Measuring the Completeness of Theories

10/15/2019 ∙ by Drew Fudenberg, et al. ∙ 0

We use machine learning to provide a tractable measure of the amount of predictable variation in the data that a theory captures, which we call its "completeness." We apply this measure to three problems: assigning certain equivalents to lotteries, initial play in games, and human generation of random sequences. We discover considerable variation in the completeness of existing models, which sheds light on whether to focus on developing better models with the same features or instead to look for new features that will improve predictions. We also illustrate how and why completeness varies with the experiments considered, which highlights the role played in choosing which experiments to run.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem and Approach

1.1 Prediction Problems

In a prediction problem, there is an outcome whose realization is of interest, and features that are statistically related to the outcome. The goal is to predict the outcome given the observed features. Some examples include predicting an individual’s future wage based on childhood covariates (city of birth, family income, quality of education, etc.), or predicting a criminal defendant’s flight risk based on their past record and properties of the crime (Kleinberg et al., 2017). We focus here on three prediction problems that emerge from experimental economics:

Example 1 (Risk Preferences).

Can we predict the valuations that people assign to various money lotteries?

Example 2 (Predicting Play in Games).

Can we predict how people will play the first time they encounter a given simultaneous-move game?

Example 3 (Human Generation of Random Sequences).

Given a target random process—for example, a Bernoulli random sequence—can we predict the errors that a human makes while mimicking this process?

Formally, suppose that the observable features belong to some space and the outcome belongs to . A map from features to outcomes is a (point) prediction rule.111

Note that a prediction of a probability distribution over

can be cast as the prediction of a point in the space of distributions on . Many economic models can be described as a parametric family of prediction rules . For example, if our model class imposes a linear relationship between the outcome and a set of features, then the parameter would define a vector of weights applied to each of the features. In the application we study in Section 2.1, the expected utility class describes a family of utility functions over dollar amounts, and the parameter reflects the degree of risk aversion.

1.2 Accuracy and Completeness

We suppose that our prediction problem comes with a a loss function, , where is the error assigned to prediction of when the realized outcome is

. The commonly used loss functions

mean-squared error and classification loss correspond to and respectively.


The expected error (or risk) of prediction rule on a new observation

generated according to the joint distribution

is222Different loss functions are typically used when predicting distributions, see e.g. Gneiting and Raftery (2007).

The prediction rule in the class that minimizes the expected prediction error is the one associated with the parameter value

The expected error of this “best” rule in is .

In Section 1.3.1

, we discuss how to estimate

on finite data; here we discuss how to interpret it. To understand a model’s error, it is helpful to distinguish between two very different error sources.

First, if the the conditional distribution is not degenerate, then even the ideal prediction rule

does not predict perfectly.


The irreducible error in the prediction problem is the expected error


of the ideal rule on a new test observation.

The irreducible error is an upper bound on how well we can predict using the features .

A different source of prediction error is the specification of which prediction rules are in the class . Typically the best possible model will not be an element of —that is, most sets of models are at least slightly misspecified. If leaves out an important regularity in the data, there may be exist models outside of that give much better predictions on this domain.333On the other hand, expanding the model class risks overfitting, so more parsimonious model classes can lead to more accurate predictions when data is scarce (Hastie et al., 2009). As we discuss in Sections 1.3.2 and 1.4, all of the data sets we consider here are large relative to the number of features.

These two sources of prediction error have very different implications for how to improve prediction in the domain. If the achieved performance of the model is substantially lower than the best feasible performance, then it may be possible to achieve large improvements without seeking additional inputs, for example by identifying new regularities in behavior. On the other hand, if the achieved prediction error is close to the best achievable level of prediction for our feature set, then only marginal gains are feasible from identification of new structure. This encourages consideration of prediction rules based on some larger feature space .

We propose the ratio of reduction in prediction error achieved by the model, compared to the achievable reduction, as a measure of how close the model comes to the best achievable performance. We call this ratio the the model’s completeness. To operationalize this measure, let be a naive rule suited to the prediction problem; this rule—such as “predict uniformly at random”—is meant to represent a lower bound on how bad predictions can be.


The completeness for the model class is


Note that the completeness measure depends on the underlying distribution . We expect the conditional distribution to be a fixed distribution describing the true dependence of the outcome on the features, but the marginal distribution over the feature space is frequently a choice variable of the analyst—e.g. which lotteries or games to run in an experiment. As we show in Section 2.2, when we change this marginal distribution, we obtain different measures of completeness for the same model. Ideally, we would like the chosen distribution over features to be the one that is most economically relevant, but in practice we may not know what that is.

1.3 Evaluating Completeness on Finite Data

Neither the true joint distribution over features and outcomes nor the derived quantities , , and are directly observable, but they can be estimated from data. We describe below an approach (tenfold cross-validation) that is standard for estimating expected prediction errors, and describe an algorithm—Table Lookup—for approximating the ideal prediction rule .

1.3.1 Cross-Validated Prediction Errors

To evaluate the predictive accuracy of a model class on a finite data set, we first choose between prediction rules in based on how well they predict a sample of training observations. Then we evaluate the trained rule on a new set of test observations.

Formally, for any integer let be the space corresponding to observations of , and suppose the analyst has access to a data set . Using the procedure of -fold cross-validation, this data is randomly split into equally-sized disjoint subsets . In each iteration of the procedure, the subset is identified as the test data and the remaining subsets are used as training data. The -th parameter estimate is the one that minimizes average loss when predicting the -th training set:

(The naive model prediction rule does not depend on the training data, and is always .) The out-of-sample error of the estimated on the test set is


If the data in are drawn i.i.d. from , the average out-of-sample error


is a consistent estimator for the expected error . The display in (4) is known as the -fold cross-validated prediction error. In the main text, we will more simply refer to it as the prediction error of the model class , understanding that it is a finite-data estimate.

Below we write for the cross-validated prediction error for the naive rule and for the cross-validated prediction error for the model class . These are respectively our estimates for and .

1.3.2 Table Lookup Benchmark

To estimate the expected error of the ideal rule , we apply a Table Lookup algorithm to each iteration of cross-validation: Formally, let

be the function that minimizes prediction error on the training data, where we search across the complete (unrestricted) class of mappings from to . Then define the cross-validated Table Lookup error as in (3) and (4). This measure, which we will denote , is a consistent estimator for the irreducible error . How good of an approximation it is depends on a comparison between the size of the data and the “effective” size of the feature set , by which we mean the number of unique feature vectors that appear in the data.444Table Lookup predicts well when we have a large number of observations for each unique feature vector . This requires either that the feature space is finite (as in our application in Section 2.3, where ), or that the data-generating measure has finite support over (as in our two applications in Sections 2.1 and 2.2). In some settings with a continuum of possible features there may be very few observations for a given feature vector. In these cases, we cannot directly use Table Lookup to approximate the ideal performance, and should instead use approaches that make assumptions on how outcomes are related at “nearby” features, e.g. kernel regression.

One way to evaluate the accuracy of is to look at the standard error of the cross-validated prediction errors, which is

We report these standard errors for each of our applications and model classes. It turns out that for each of the applications we look at, and we suspect for other data sets as well, the Table Lookup standard errors are relatively small. (See Appendix A.1

for more detail.) As another test, we compare the performance of Table Lookup with a different machine learning algorithm that is better suited to smaller data sets (bagged decision trees), and find that Table Lookup’s performance is comparable but better for all of our applications (see Appendix

A.2). These analyses suggest that the Table Lookup performance is indeed a reasonable approximation for the best achievable performance in each of our applications.

In place of the ideal completeness measure described in (2), we compute the following ratio from our data:

This is the ratio of reduction in cross-validated prediction error achieved by the model (relative to the naive baseline) compared to the reduction achieved by Table Lookup (again relative to the naive baseline).

1.4 Relationship to Literature

Irreducible error is an old concept in statistics and machine learning, and a large amount of work has focused on further decomposing this error into bias (reflecting error due to the specification of the model class) and variance

(reflecting sensitivity of the estimated rule to the randomness in the training data). Depending on the quantity of data available to the analyst, it may be preferable to trade off bias for variance or vice versa.

555For example, given small quantities of data, we may prefer to work with models that have fewer free parameters, leading to higher bias but potentially substantially lower variance. This paper abstracts from these concerns, as well as the related concern of overfitting. We work exclusively with data sets where the quantity of data is large enough that the most predictive model is approximately the most complex one, i.e. Table Lookup (see Appendix A).

A related literature compares the performance of specific machine learning algorithms to that of existing economic models. These algorithms are themselves potentially incomplete relative to the best achievable level, and thus provide a lower bound for the best achievable level, where the degree to which they are incomplete is a priori unknown. The closest of these papers to our work is Peysakhovich and Naecker (2017), which studies choices under uncertainty and under ambiguity, and constructs a benchmark based on regularized regression algorithms.666In addition, Ori Plonsky (2017), Noti et al. (2016), and Plonsky et al. (2019) develops algorithmic models for predicting choice, Camerer et al. (2018) uses machine learning to predict disagreements in bargaining, and Aaron Bodoh-Creed and Hickman (2019)

uses random forests to predict pricing variation. The improvements achieved by the algorithms are sometimes modest, perhaps due to intrinsic noise, as

Bourgin et al. (2019) point out. We show how this noise can be quantified.

Erev et al. (2007) define a a model’s equivalent number of observations as the number of prior observations such that the mean of a data set of random observations has the same prediction error as the model. We expect that models with larger numbers of equivalent observations will be more complete by our metric.

Finally, an alternative measure of a model’s performance is the proportion of the variance in the outcome that it explains, that is the model’s . This measure is not well suited to the question of the model’s completeness, because the best achievable cannot be directly inferred from the of any existing model.777We could, however, develop a notion of completeness based on comparing the achieved with the best achievable , analogous to what we do here.

2 Three Applications

2.1 Domain #1: Assigning Certain Equivalents to Lotteries

Background and Data.

An important question in economics is how individuals evaluate risk. In addition to the Expected Utility models (von Neumann and Morgenstern, 1944; Savage, 1954; Samuelson, 1952), one of the most influential models of decision-making under risk in the last few decades has been Cumulative Prospect Theory (Tversky and Kahneman, 1992). This model provides a flexible family of risk preferences that accommodates certain behavioral anomalies, including reference-dependent preferences and nonlinear probability weighting.

A standard experimental paradigm for eliciting risk preferences, and thus for evaluating these models, is to ask subjects to report certainty equivalents for lotteries—i.e. the lowest certain payment that the individual would prefer over the lottery. We consider a data set from Bruhin et al. (2010), which includes 8906 certainty equivalents elicited from 179 subjects, all of whom were students at the University of Zurich or the Swiss Federal Institute of Technology Zurich. Subjects reported certainty equivalents for the same 50 two-outcome lotteries, half over positive outcomes (e.g. gains) and half over negative outcomes (e.g. losses).

Prediction Task and Models.

In this data set, the outcomes are the reported certainty equivalents for a given lottery, and the features are the lottery’s two possible monetary prizes and , and the probability of the first prize. A prediction rule is any function that maps the tuple into a prediction for the certainty equivalent, i.e. a function .

We evaluate two prediction rules that are based on established models from the literature. Our Expected Utility (EU) rule sets the agent’s utility function to be , where is a free parameter that we train. The predicted certainty equivalent is .

Second, our Cumulative Prospect Theory (CPT) rule predicts for each lottery, where is a probability weighting function and is a value function. We follow the literature (see e.g. Bruhin et al. (2010)) in assuming the functional forms:


This model has four free parameters .

Finally, as a naive benchmark, we predict the expected value of the lottery, which is .888This naive benchmark is arguably less naive than the naive benchmarks we use for the other prediction problems. Replacing our naive benchmark with, for example, an unconditional mean, would result in even higher completeness for CPT than we already find in Table 2.

Performance Metric.

For a given test set of observations —where is the lottery shown in observation , and is the reported certainty equivalent—we evaluate the prediction error of prediction rule using

This loss function, mean-squared error, penalizes quadratic distance from the predicted and actual response, and is minimized when is the mean response for lottery .

To conduct out-of-sample tests of the models described above, we follow the standard approach of tenfold cross-validation described in Section 1.3.1, estimating the free parameters of the model on training data and evaluating how well the estimated model predicts choices in a test set.


The following table reveals that both models are predictive, improving upon the Expected Value benchmark:999The parameter estimate for EU is , and the parameter estimates for CPT are and .

Naive Benchmark 103.81
Expected Utility 99.67
CPT 67.38
Table 1: Both models are predictive.

The improvement of CPT over the naive benchmark is larger than that of Expected Utility, but the CPT performance is substantially worse than perfect prediction. It is not surprising that these models do not achieve perfect prediction, as we expect different subjects to report different certainty equivalents for the same lottery, and thus a model that provides the same prediction for each input cannot possibly predict every reported certainty equivalent.

But another source for prediction error is the functional form assumptions that we made in (5). Could a different (potentially more complex) specification for the value function or probability weighting function lead to large gains in prediction? Moreover, might there be other features of risk evaluation, yet unmodelled, which lead to even larger improvements in prediction?

To separate these sources of error, we need to understand how the CPT performance compares to the best achievable performance for this data. For this evaluation, we construct an ideal benchmark using a Table Lookup procedure. The lookup table’s rows correspond to the 50 unique lotteries in our data, and the predicted certainty equivalent for each lottery is the mean response for that lottery in the training data. Given sufficiently many reports for each lottery, the lookup table prediction approximates the actual mean responses in the test data, and its error approximates the best possible error that is achievable by any prediction rule that takes as its input. We report this benchmark below in Table 2:

Error Completeness
Naive Benchmark 103.81 0%
Expected Utility 99.67 11%
CPT 67.38 95%
Table Lookup 65.58 100%
Table 2: CPT is nearly complete for prediction of our data.

The Table Lookup benchmark shows that no prediction rule based on can improve more than slightly over CPT on this data, because CPT obtains 95% of the feasible improvement in prediction.101010From this data it is hard to know whether the high completeness of CPT (in the specified functional form) comes from its good match to actual behavior or because it is flexible enough to mimic Table Lookup on many data sets. We leave exploration of this question to future work. This tells us that to make substantially better predictions, we would need to expand the set of variables on which the model depends. For example, as we discuss in Section 11, we could group subjects using auxiliary data such as their evaluations of other lotteries or response times, and make separate predictions for each group.

We note that our completeness measure does not imply that in general CPT is a nearly-complete model for predicting certainty equivalents, since the completeness measure we obtain is determined from a specific data set, and thus its generalizability depends on the extent to which that data is representative. Indeed, the data from Bruhin et al. (2010) has certain special features; for example, all lotteries in the data are over two possible outcomes. It would be an interesting exercise to evaluate the completeness of CPT using observations on lotteries with more complex supports.

2.2 Domain #2: Initial Play in Games

Background and Data.

In many game theory experiments, equilibrium analysis has been shown to be a poor predictor of the choices that people make when they encounter a new game. This has led to models of initial play that depart from equilibrium theory, for example the level-

models of Stahl and Wilson (1994) and Nagel (1995), the Poisson Cognitive Hierarchy model (Camerer et al., 2004), and the related models surveyed in Crawford et al. (2013). These models represent improvements over the equilibrium predictions, but we do not know how substantial these improvements are. Are there important regularities in play that have not yet been modeled?

To study this question, we use a data set from Fudenberg and Liang (2018) consisting of 23,137 total observations of initial play from 486 matrix games.111111This data is an aggregate of three data sets: the first is a meta data set of play in 86 games, collected from six experimental game theory papers by Kevin Leyton-Brown and James Wright, see Wright and Leyton-Brown (2014); the second is a data set of play in 200 games with randomly generated payoffs, which were gathered on MTurk for Fudenberg and Liang (2018); the final is a data set of play in 200 games that were “algorithmically designed” for a certain model (level 1) to perform poorly, again from Fudenberg and Liang (2018).121212There was no learning in these experiments—subjects were randomly matched to opponents, were not informed of their partners’ play, and did not learn their own payoffs until the end of the session. As in the previous section, we pool observations across all of the subjects and games.

Prediction Task, Performance Metric, and Models.

In the prediction problem we consider here, the outcome is the action that is chosen by the row player in a given instance of play, and the features are the 18 entries of the payoff matrix. A prediction rule is thus any map from payoff matrices to row player actions.

For each prediction rule and test set of observations —where is the payoff matrix in observation , and is the observed row player action—we evaluate error using the misclassification rate

This is the fraction of observations where the predicted action was not the observed action.

As a naive baseline, we consider guessing uniformly at random for all games, which yields an expected misclassification rate of . Additionally, we consider a prediction rule based on the Poisson Cognitive Hierarchy Model (PCHM), which supposes that there is a distribution over players of differing levels of sophistication: The level-0 player is maximally unsophisticated and randomizes uniformly over his available actions, while the level-1 player best responds to level-0 play (Stahl and Wilson, 1994, 1995; Nagel, 1995). Camerer et al. (2004) defines the play of level- players, , to be best responses to a perceived distribution


over (lower) opponent levels, where

is the Poisson distribution with rate parameter

.131313Throughout, we take to be a free parameter and estimate it from the training data. A predicted distribution over actions is derived by supposing that the proportion of level- players in the population is proportional to . Assuming this is the true distribution of play, the misclassification rate is minimized by predicting the mode of this distribution, and this is what we set as the PCHM prediction.

As in Section 2.1, we estimate the free parameter on training data, and evaluate the out-of-sample prediction of the estimated model. All reported prediction errors are tenfold cross-validated.


Because we use the classification loss as the loss function, the best attainable classification error will differ across games: In games where all subjects choose the same action, the perfect -error prediction is feasible, but when play is close to uniform over the actions, it will be hard to improve over random guessing. This means that the same level of predictive accuracy should potentially be evaluated quite differently, depending on what kinds of games are being predicted.

We illustrate this by comparing predictions for two subsets of our data: Data Set A consists of the 16,660 observations of play from the 359 games with no strictly dominated actions.141414Specifically, we consider games where no pure action is strictly dominated by another pure action. Data Set B consists of the 7,860 observations of play from the 161 games in which the action profile with the highest sum of player payoffs is outside of the support of level- actions,151515Here we use the classic definition from Stahl and Wilson (1995) and Nagel (1995), where each level- action is the best response to the level- action. and moreover the difference in the payoff sums is large (at least 20% of the largest row player payoff in the game.) For example, the following game is included in Data Set B:

In this game, action is level 1, since it yields the highest expected payoff against uniform play, and action is level 2, since it is a best response against play of . Because is a pure-strategy Nash equilibrium, action is then level- for all . The highest possible player sum achieved by playing either or is 120 (from action profile ), but the action profile yields a higher payoff sum of 160. The difference, 40, is of the max row player payoff in this game, 80.

In both data sets, a range of values for the free parameter generate the same predicted modal action, and so have the same cross-validated prediction error. For all of the games in our data, this mode is simply the level-1 action. But as Table 3 shows, PCHM improves upon the naive benchmark by a larger amount for prediction of play in Data Set B, compared to Data Set A. Using perfect prediction as the benchmark, this would imply that PCHM is a more complete model of play for games in Data Set B.161616Instead of our task of predicting each action, Fudenberg and Liang (2018) studies the task of predicting the modal action in each game; the ideal prediction for that task always has no error at all. Correspondingly for that prediction task, Fudenberg and Liang (2018) also used a different cross-validation procedure: Instead of dividing the data into folds at random as described above, it split the set of games so that the games in the training set were not used for testing.This alternative is relevant for the study of how well we can extrapolate from one game to another, which is not the question of interest here.

Data Set A Data Set B
Naive Benchmark 0.66 0.66
PCHM 0.49 0.44
(0.004) (0.009)
Table 3: PCHM improves upon the naive baseline by a larger amount for prediction of play in Data Set B.

But the amount of irreducible error in the two data sets may be quite different, leading to different predictive limits. Thus we need to understand how the prediction errors compare to the best achievable error for the two data sets. We can again gain insight into this by building a lookup table. The rows of the table are the different games, and the associated predictions are the modal actions (observed for those games) in the training data. Given sufficiently many observations, the modal action in the training data will also be the action most likely to be played in the test data, thus minimizing classification error.

Below we report the Table Lookup performance and completeness measures relative to this performance.

Data Set A Data Set B
Error Completeness Error Completeness
Naive Benchmark 0.66 0% 0.66 0%
PCHM 0.49 68% 0.44 67%
(0.006) (0.009)

Table Lookup
0.41 100% 0.34 100%
(0.005) (0.006)
Table 4: PCHM achieves roughly the same completeness for both data sets.

Although PCHM achieves a smaller absolute improvement over the naive baseline for Data Set A, the achievable improvement is also lower. Thus, relative to the appropriate benchmarks, the completeness of PCHM is in fact roughly equivalent for the two data sets (and marginally lower for Data Set B). This comparison illustrates how prediction accuracy can be misleading without an accompanying benchmark.

Our exercise here is not special to the two sets of games we have examined; indeed, we can repeat the analysis for other subsets of the data, and determine completeness measures for each of these. For example, Table 5 reports prediction errors for data set consisting of the 9,243 observations of play from the 175 games where the level 1 action’s expected payoff against uniform play is much higher than the expected payoff of the next best action (specifically, it is larger by at least of the max row player payoff in the game).

Error Completeness
Naive Benchmark 0.66 0%
PCHM 0.28 97%

Table Lookup
0.27 100%
Table 5: Prediction errors for games in which the level-1 action is much better against uniform play than the next best action.

The Table Lookup error is much lower for this set of games, revealing that for these games, play is much more concentrated on a single action. Thus we would hope for our models to also achieve higher predictive accuracies, and indeed we find that PCHM predicts an incorrect action only 28% of the time. For a more exhaustive inquiry into when PCHM succeeds and fails, we could elicit completeness measures for different subsets of the data, and identify those games where PCHM is most incomplete.

2.3 Domain #3: Human Generation of Random Sequences

Background and Data.

Extensive experimental and empirical evidence suggests that humans misperceive randomness, expecting for example that sequences of coin flips “self-correct” (too many Heads in a row must be followed by a Tails) and are balanced (the proportion of Heads and Tails are approximately the same) (Bar-Hillel and Wagenaar, 1991; Tversky and Kahneman, 1971). These misperceptions are significant not only for their basic psychological interest, but also for the ways in which misperception of randomness manifests itself in a variety of contexts: for example, investors’ judgment of sequences of (random) stock returns (Barberis et al., 1998), professional decision-makers’ reluctance to choose the same (correct) option multiple times in succession (Chen et al., 2016), and people’s execution of a mixed strategy in a game (Batzilis et al., 2016).

A common experimental framework in this area is to ask human participants to generate fixed-length strings of (pseudo-)random coin flips, for some small value of (e.g. ), and then to compare the produced distribution over length- strings to the output of a Bernoulli process that generates realizations from independently and uniformly at random (Rapaport and Budescu, 1997; Nickerson and Butler, 2009). Following in this tradition, we use the platform Mechanical Turk to collect a large dataset of human-generated strings designed to simulate the output of a Bernoulli(0.5) process, in which each symbol in the string is generated from independently and uniformly at random. To incentive effort, we told subjects that payment would be approved only if their (set of) strings could not be identified as human-generated with high confidence.171717In one experiment, 537 subjects each whom produced 50 binary strings of length eight. In a second experiment, an additional 101 subjects were asked to each generate 25 binary strings of length eight.181818Subjects were informed: “To encourage effort in this task, we have developed an algorithm (based on previous Mechanical Turkers) that detects human-generated coin flips from computer-generated coin flips. You are approved for payment only if our computer is not able to identify your flips as human-generated with high confidence.” Following removal of subjects who were clearly not attempting to mimic a random process, our final data set consisted of 21,975 strings generated by 167 subjects.191919Our initial data set consists of 29,375 binary strings. We chose to remove all subjects who repeated any string in more than five rounds. This cutoff was selected by looking at how often each subject generated any given string, and finding the average “highest frequency” across subjects. This turned out to be 10% of the strings, or five strings. Thus, our selection criteria removes all subjects whose highest frequency was above average. This selection eliminated 167 subjects and 7,400 strings, yielding a final dataset with 471 subjects and 21,975 strings. We check that our main results are not too sensitive to this selection criteria by considering two alternative choices in Appendix C.2—first, keeping only the initial 25 strings generated by all subjects, and then, removing the subjects whose strings are “most different” from a Bernoulli process under a -test. We find very similar results under these alternative criteria.

Prediction Task, Performance Metric, and Models.

We consider the problem of predicting the probability that the eighth entry in a string is given its first seven elements. Thus the outcome here is a number in —that is a distribution on —and the feature space is (note that as in the previous examples we fit a representative-agent model and do not treat the identity of the subject as feature).

Given a test dataset of binary strings of length-8, we evaluate the error of the prediction rule using mean-squared error

where is the predicted probability that the eighth flip is ‘’ given the observed initial seven flips , and is the actual eighth flip.202020Alternatively we could have defined the outcome to be an individual realization of or , so that prediction rules are maps , and then evaluated error using the misclassification rate (i.e. the fraction of instances where the predicted outcome was not the realized outcome). We do not take a stand on which method is better, but note that the completeness measure can depend on which one is used. In Appendix C.1 we show that the completeness measures are very similar using this alternative formulation. Note that the naive baseline of unconditionally guessing 0.5 guarantees a mean-squared prediction error of 0.25. Moreover, if the strings in the test set were truly generated via a Bernoulli(0.5) process, then no prediction rule could improve in expectation upon the naive error.212121Due to the convexity of the loss function, it is possible to do worse than the naive baseline, for example by predicting 1 unconditionally. We expect that the presence of behavioral errors in the generation process will make it possible to improve upon the naive baseline, but do not know how much it is possible to improve upon 0.25.

In this task, the natural naive baseline is the rule that unconditionally guesses that the probability the final flip is ‘’ is 0.5. We compare this to prediction rules based on Rabin (2002) and Rabin and Vayanos (2010), both of which predict generation of negatively autocorrelated sequences.222222Although both of these frameworks are models of mistaken inference from data, as opposed to human attempts to generate random sequences, they are easily adapted to our setting, as the papers explained. Our prediction rule based on Rabin (2002) supposes that subjects generate sequences by drawing sequentially without replacement from an urn containing ‘1’ balls and ‘0’ balls. The urn is “refreshed” (meaning the composition is returned to its original) every period with independent probability . This model has two free parameters: and .

Our prediction rule based on Rabin and Vayanos (2010) assumes that the first flip while each subsequent flip is distributed

where the parameter reflects the (decaying) influence of past flips, and the parameter measures the strength of negative autocorrelation.232323We make a small modification on the Rabin and Vayanos (2010) model, allowing instead of .


Table 6 shows that both prediction rules improve upon the naive baseline. The need for a benchmark for achievable prediction is starkest in this application, however, as the best improvement is only 0.0008, while the gap between the best prediction error and a perfect zero is large. This is not surprising, as we expect substantial variation in the eighth flip following the same initial seven flips because we asked subjects to mimic a fair coin.

Error Naive Benchmark 0.25 Rabin (2002) 0.2494 (0.0007) Rabin and Vayanos (2010) 0.2492 (0.0007)
Table 6: Both models improve upon naive guessing, but the absolute improvement is small.

For this problem, the lookup table’s rows correspond to the unique initial seven-flip sequences, and we associate each such string to the empirical frequency with which that string is followed by ‘’ in the training data. Given a sufficiently large training set, we can approximate the true continuation frequency for each initial sequence, and hence approximate the best achievable error. We note here that although there are unique initial sequences, with approximately 21,000 strings in our data set, we have (on average) 164 observations per initial sequence.

Error Completeness Naive Benchmark 0.25 0 Rabin (2002) 0.2494 10% (0.0007) Rabin & Vayanos (2010) 0.2492 14% (0.0007) Table Lookup 0.2441 100% (0.0006)
Table 7: The Table Lookup benchmark permits a more accurate representation of the completeness of these models.

We find that Table Lookup achieves a prediction error of 0.2439, so that naively comparing achieved prediction error against perfect prediction (which would suggest a completeness measure of at most 0.4%) grossly misrepresents the performance of the models. Relative to the Table Lookup benchmark, the existing models produce up to 14% of the achievable improvement in prediction error. This suggests that although negative autocorrelation is indeed present in the human-generated strings, and explains a sizable part of the deviation from a Bernoulli(0.5) process, there is additional structure that could yet be exploited for prediction.

3 Extensions

3.1 Subject Heterogeneity

So far we’ve focused on evaluating representative agent models that implement a single prediction across all subjects. When we evaluate models that include subject heterogeneity, the question of what is the best achievable level of accuracy is still relevant, and the suitable analogue of Table Lookup—with subject type added as an additional feature—can again help us to determine this. The exact implementation of Table Lookup will depend on how the groups are determined. As a simple illustration, we return to our first domain—evaluation of risk—and demonstrate how to construct a predictive bound for certain models with subject heterogeneity.

The models that we consider extend the Expected Utility and Cumulative Prospect Theory models introduced in Section 2.1 by allowing for three groups of subjects. To test the models, we randomly select 71 (out of 171) subjects to be test subjects, and 45 (out of 50) lotteries to be test lotteries. All other data—the 100 training subject’s choices in all lotteries, as well as the test subject’s choices in the 5 training lotteries—are used for training the models.

In more detail, we first use the training subjects’ responses in the training lotteries to develop a clustering algorithm for separating subjects into three groups.242424We use a simple algorithm, -means, which minimizes the Euclidean distance between the vectors of reported certainty equivalents for subjects within the same group. This algorithm can be used to assign a group number to any new subject based on their choices in the five training lotteries. Second, we use each group’s training subjects’ responses in the test lotteries to estimate free model parameters—that is, the single free parameter of the EU model, and the four free parameters for CPT. This yields three versions of EU and CPT, one per group.

Out of sample, we first use the clustering algorithm to assign groups to the test subjects, and then use the associated models to predict their certainty equivalents in the test lotteries. We measure accuracy using mean-squared error, as in Section 2.1, and we again report the Expected Value prediction as a naive baseline.

Prediction Error
Naive Benchmark 91.13
Expected Utility 86.68
CPT 57.14
Table 8: Prediction Errors Achieved by Models with Subject Heterogeneity

What we find from Table 8 is very similar to what we observed in Section 2.1: Both models improve upon the naive baseline, but we do not know how complete these improvements are. To better evaluate the achieved improvements, we need a benchmark that tells us the best feasible prediction.

Our approach here for constructing an upper bound is to learn the mean response of training subjects in each group for each lottery, and predict those means. With sufficiently many training subjects, this method approximates the best possible accuracy. We find that although the CPT error is substantially different from zero, the model is again nearly complete.

Prediction Error Completeness
Naive Benchmark 104.17 0%
Expected Utility 86.68 36%
CPT 57.14 96%
Table Lookup 55.45 100%

Because the same clustering method is used in all of the approaches, the gap between Table Lookup and the existing models does not shed light on how much predictions could be improved by better ways of grouping the subjects. The comparison of Table Lookup’s performance here, 55.45, with its performance from Section 2.1, 65.58, sheds light on the size of predictive gains achieved by the present method for clustering.

3.2 Comparing Feature Sets

In addition to evaluating the predictive limits of a feature set and the completeness of existing models, Table Lookup can be used to compare the predictive power of different feature sets. We illustrate this potential comparison by revisiting our problem from Section 2.3—predicting human generation of randomness—and evaluating the predictive value of certain features. To do this, we consider “compressed” Table Lookup algorithms built on different properties of the string, where strings of the same type are bucketed into the same row, and focus on the the predictive value of two properties: number of Heads, and flips 4-8. Our compressed Table Lookup based on the number of Heads partitions the set of length-7 strings depending on the total number ‘’ flips in the string, and learns a prediction for each partition element; similarly, our compressed Table Lookup based on flips 4-7 partitions the set of strings depending only on outcomes including and after flip 4. Just as our original Table Lookup algorithm returned an approximation of the highest level of predictive accuracy using the full structure of initial flip data, these compressed Table Lookup algorithms approximate the highest level of predictive accuracy that is achievable using a particular kind of structure in the strings.

Error Completeness Naive Benchmark 0.25 0% Flips 4-7 0.2478 36% (0.0010) Number of Heads 0.2464 59% (0.0009)
Full Table Lookup
0.2441 100%
Table 9: Comparison of the value of various feature sets.

We find that these simple features achieve large fractions of the achievable improvement over the naive rule of always predicting that the probability of is . For example, using only the number of Heads as a feature achieves 59% of the improvement of full Table Lookup. Using only the most recent three flips achieves 36% of the predictive improvement that is achieved by using all seven initial flips; the fact that this improvement is not complete demonstrates that there is predictive content in flips 1-3 beyond what is captured in flips 4-7.

The feature set corresponding to our “Full Table Lookup” is itself partial relative to an even richer feature set. It is interesting to consider what might constitute a set of unmeasured features of the human participant’s behavior that would significantly improve predictive accuracy, for example the speed at which the strings were entered. The exercise in Section 3.1, in which we used subject types (determined based on choices in auxiliary problems), constitutes yet another way to expand the table. As we have shown above, the application and comparison of Table Lookup for different feature sets is one potentially useful approach for measuring the predictive value of those features.252525Note that the value of individual features will in general depend on what other features are available.

4 Conclusion

When evaluating the predictive performance of a theory, it is important to know not just whether the theory is predictive, but also how complete its predictive performance is. We propose the use of Table Lookup as a way to measure the best achievable predictive performance for a given problem, and the completeness of a model as a measure of how close it comes to this bound. We demonstrate three domains in which completeness can help us to evaluate the performance of existing models.

The present paper has focused on the criterion of predictiveness. When we take other criteria into account, such as the interpretability or generality of the model, then we may prefer models that are not 100% complete by the measure proposed here—for example, we may prefer to sacrifice some predictive power in return for higher explainability, as in Fudenberg and Liang (2018).

Finally, we note that all the tests mentioned so far involve training and testing models on data drawn from the same domain. A question for future work would be how to compare the transferability of models across domains. Indeed, we may expect that economic models that are outperformed by machine learning models in a given domain have higher transfer performance outside of the domain. In this sense, within-domain completeness may provide an incomplete measure of the “overall completeness” of the model, and we leave development of such notions to future work.


  • Aaron Bodoh-Creed and Hickman (2019) Aaron Bodoh-Creed, J. B. and B. Hickman (2019): “Using Machine Learning to Explain Price Dispersion,” .
  • Bar-Hillel and Wagenaar (1991) Bar-Hillel, M. and W. Wagenaar (1991): “The Perception of Randomness,” Advances in Applied Mathematics.
  • Barberis et al. (1998) Barberis, N., A. Shleifer, and R. Vishny (1998): “A Model of Investor Sentiment,” Journal of Financial Economics.
  • Batzilis et al. (2016) Batzilis, D., S. Jaffe, S. Levitt, J. A. List, and J. Picel (2016): “How Facebook Can Deepen our Understanding of Behavior in Strategic Settings: Evidence from a Million Rock-Paper-Scissors Games,” Working Paper.
  • Bourgin et al. (2019) Bourgin, D. D., J. C. Peterson, D. Reichman, T. L. Griffiths, and S. J. Russell (2019): “Cognitive Model Priors for Predicting Human Decisions,” CoRR, abs/1905.09397.
  • Bruhin et al. (2010) Bruhin, A., H. Fehr-Duda, and T. Epper (2010): “Risk and Rationality: Uncovering Heterogeneity in Probability Distortion,” Econometrica.
  • Camerer et al. (2004) Camerer, C. F., T.-H. Ho, and J.-K. Chong (2004): “A cognitive hierarchy model of games,” The Quarterly Journal of Economics, 119, 861–898.
  • Camerer et al. (2018) Camerer, C. F., G. Nave, and A. Smith (2018): “Dynamic unstructured bargaining with private information: theory, experiment, and outcome prediction via machine learning,” Management Science.
  • Chen et al. (2016) Chen, D., K. Shue, and T. Moskowitz (2016): “Decision-Making under the Gambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires,” Quarterly Journal of Economics.
  • Crawford et al. (2013) Crawford, V. P., M. A. Costa-Gomes, and N. Iriberri (2013): “Structural models of nonequilibrium strategic thinking: Theory, evidence, and applications,” Journal of Economic Literature, 51, 5–62.
  • Domingos (2000) Domingos, P. (2000): “A Unified Bias-Variance Decomposition and its Applications,” Proc. 17th International Conf. on Machine Learning.
  • Erev et al. (2007) Erev, I., A. E. Roth, R. L. Slonim, and G. Barron (2007): “Learning and equilibrium as useful approximations: Accuracy of prediction on randomly selected constant sum games,” Economic Theory, 33, 29–51.
  • Fudenberg and Liang (2018) Fudenberg, D. and A. Liang (2018): “Predicting and Understanding Initial Play,” Working Paper.
  • Gneiting and Raftery (2007) Gneiting, T. and A. E. Raftery (2007): “Strictly Proper Scoring Rules, Prediction, and Estimation,” Journal of the American Statistical Association.
  • Hastie et al. (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009): The Elements of Statistical Learning, Springer.
  • Kleinberg et al. (2017) Kleinberg, J., H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan (2017): “Human Decisions and Machine Predictions,” The Quarterly Journal of Economics.
  • Nagel (1995) Nagel, R. (1995): “Unraveling in Guessing Games: An Experimental Study,” American Economic Review, 85, 1313–1326.
  • Nickerson and Butler (2009) Nickerson, R. and S. Butler (2009): “On Producing Random Sequences,” American Journal of Psychology.
  • Noti et al. (2016) Noti, G., E. Levi, Y. Kolumbus, and A. Daniely (2016): “Behavior-Based Machine-Learning: A Hybrid Approach for Predicting Human Decision Making,” CoRR, abs/1611.10228.
  • Ori Plonsky (2017) Ori Plonsky, Ido Erev, T. H. M. T. (2017): “Psychological forest: Predicting human behavior,”

    AAAI Conference on Artificial Intelligence

  • Peysakhovich and Naecker (2017) Peysakhovich, A. and J. Naecker (2017): “Using Methods from Machine Learning to Evaluate Behavioral Models of Choice Under Risk and Ambiguity,” Journal of Economic Behavior and Organization.
  • Plonsky et al. (2019) Plonsky, O., R. Apel, E. Ert, M. Tennenholtz, D. Bourgin, J. C. Peterson, D. Reichman, T. L. Griffiths, S. J. Russell, E. C. Carter, J. F. Cavanagh, and I. Erev (2019): “Predicting human decisions with behavioral theories and machine learning,” CoRR, abs/1904.06866.
  • Rabin (2000) Rabin, M. (2000): “Risk Aversion and Expected-utility Theory: A Calibration Theorem,” Econometrica, 68, 1281–1292.
  • Rabin (2002) ——— (2002): “Inference by Believers in the Law of Small Numbers,” The Quarterly Journal of Economics.
  • Rabin and Vayanos (2010) Rabin, M. and D. Vayanos (2010): “The Gambler’s and Hot-Hand Fallacies: Theory and Applications,” Review of Economic Studies.
  • Rapaport and Budescu (1997) Rapaport, A. and D. Budescu (1997): “Randomization in Individual Choice Behavior,” Psychological Review.
  • Samuelson (1952) Samuelson, P. (1952): “Probability, Utility, and the Independence Axiom,” Econometrica.
  • Savage (1954) Savage, L. (1954): The Foundations of Statistics, J. Wiley.
  • Stahl and Wilson (1994) Stahl, D. O. and P. W. Wilson (1994): “Experimental evidence on players’ models of other players,” Journal of Economic Behavior and Organization, 25, 309–327.
  • Stahl and Wilson (1995) ——— (1995): “On players’ models of other players: Theory and experimental evidence,” Games and Economic Behavior, 10, 218–254.
  • Tversky and Kahneman (1971) Tversky, A. and D. Kahneman (1971): “The Belief in the Law of Small Numbers,” Psychological Bulletin.
  • Tversky and Kahneman (1992) ——— (1992): “Advances in Prospect Theory: Cumulative Representation of Uncertainty,” Journal of Risk and Uncertainty, 5, 297–323.
  • von Neumann and Morgenstern (1944) von Neumann, J. and O. Morgenstern (1944): Theory of Games and Economic Behavior, Princeton University Press.
  • Wright and Leyton-Brown (2014) Wright, J. R. and K. Leyton-Brown (2014): “Level-0 meta-models for predicting human behavior in games,” Proceedings of the fifteenth ACM conference on Economics and computation, 857–874.

Appendix A Is Table Lookup the most predictive algorithm for our data?

In the main text, we use the performance of Table Lookup as an approximation of the best possible accuracy. Below we investigate whether the data sets we study are large enough for this to be a good approximation.

We first review some results from the machine learning and statistics literatures, which explain how the cross-validated standard errors that we report in the main text can be used as a measure for how well the Table Lookup error approximates the irreducible error (Section A.1).

In Section A.2, we compare Table Lookup’s performance with that of bagged decision trees, an algorithm that scales better to smaller quantities of data. We find that in each of our prediction problems, the two prediction errors are similar, and Table Lookup weakly outperforms bagged decision trees. Finally, in Section A.3, we study the sensitivity of the Table Lookup performance to the quantity of data. The predictive accuracies achieved using our full data sets are very close to those achieved using, for example, just 70% of the data. This again suggests that only minimal improvements in predictive accuracy are feasible from further increases in data size.

a.1 Cross-Validated Standard Error

Suppose that the loss function is mean-squared error: . (Similar arguments apply for the misclassification rate; see e.g. Domingos (2000).) Let

be the ideal prediction rule discussed in Section 1.2, which assigns to each its expected outcome under distribution . Write for the random Table Lookup prediction rule that has been estimated from a set of i.i.d. training observations. The expected mean-squared error of on a new observation can be decomposed as follows (Hastie et al., 2009):

where the expectation is both over the realization of the training data used to train Table Lookup, and also over the realization of the test observation .

The first component is the irreducible noise introduced in (1). The second component, bias, is the mean-squared difference between the expected Table Lookup prediction and the prediction of the ideal prediction rule The final component, sampling error or variance, is the variance of the Table Lookup prediction (reflecting the sensitivity of the algorithm to the training data).

Since Table Lookup is an unbiased estimator, the second component is zero. Thus, irreducible noise is the difference between the expected Table Lookup error and the sampling error of the Table Lookup predictor. As described in Section

1.3, we follow standard procedures of using the cross-validated prediction error to estimate the expected Table Lookup error, and using the variance of the cross-validated prediction errors to estimate the sampling error (Hastie et al., 2009). That is,

where is the prediction error for the -th iteration of cross-validation. The right-hand side of the display is the square of the cross-validated standard errors reported in the main text; thus, we have from Tables 2, 4, and 7:

Table Lookup Error Sampling Error
Risk Preferences 65.58 9
Predicting Initial Play, Data Set A 0.41 0.0001
Predicting Initial Play, Data Set B 0.34 0.0001
Human Generation of Random Sequences 0.2441 0.0001

a.2 Comparison with Scalable Machine Learning Algorithms

Another way to evaluate whether our Table Lookup algorithm approximates the best possible prediction accuracy is to compare it with the performance of other machine learning algorithms. Below we compare its performance with bagged decision trees (also known as bootstrap-aggregated decision trees). This algorithm creates several bootstrapped data sets from the training data by sampling with replacement, and then trains a decision tree on each bootstrapped training set. Decision trees are nonlinear prediction models that recursively partition the feature space and learn a (best) constant prediction for each partition element. The prediction of the bagged decision tree algorithm is an aggregation of the predictions of individual decision trees. When the loss function is mean-squared error, the decision tree ensemble predicts the average of the predictions of the individual trees. When the loss function is misclassification rate, the decision tree ensemble predicts based on a majority vote across the ensemble of trees.

Table 10 shows that for each prediction problem, the error of the bagged decision tree algorithm is comparable to and slightly worse than that of the Table Lookup algorithm. These results again suggest that the Table Lookup error is a reasonable approximation for the best achievable error.

Risk Games A Games B Sequences
Bagged Decision Trees 65.65 0.45 0.36 0.2442
(0.10) (0.004) (0.005) (0.0005)
Table Lookup 65.58 0.41 0.34 0.2441
(3.00) (0.005) (0.006) (0.0006)
Table 10: Table Lookup outperforms Bagged Decision Trees in each of our prediction problems.

a.3 Performance of Table Lookup on Smaller Samples

Finally, we report the Table Lookup cross-validated performance on random samples of % of our data, where . For each , we repeat the procedure 1000 times, and report the average performance across iterations. We find that the Table Lookup performance flattens out for larger values of , suggesting that the quantity of data we have is indeed large enough that further increases in the data size will not substantially improve predictive performance.

% Risk Games A Games B Sequences
10% 69.47 0.4191 0.3473 0.2592
(11.13) (0.012) (0.018) (0.0034)
20% 67.13 0.4183 0.3476 0.2504
(7.95) (0.0018) (0.024) (0.0018)
30% 66.28 0.4178 0.3472 0.2479
(6.51) (0.0022) (0.0029) (0.0014)
40% 66.25 0.4169 0.3470 0.2464
(5.65) (0.0024) (0.0032) (0.0011)
50% 65.68 0.4157 0.3459 0.2458
(4.59) (0.0025) (0.0036) (0.0010)
60% 65.68 0.4141 0.3449 0.2452
(4.24) (0.0027) (0.0040) (0.0008)
70% 65.68 0.4131 0.3435 0.2448
(3.95) (0.0031) (0.0045) (0.0007)
80% 65.68 0.4119 0.3427 0.2445
(3.95) (0.0034) (0.0046) (0.0007)
90% 65.66 0.4109 0.3416 0.2443
(3.71) (0.0034) (0.0047) (0.0007)
100% 65.58 0.4100 0.3404 0.2441
(3.00) (0.0036) (0.0051) (0.0006)
Table 11: Performance of Table Lookup using % of the data, averaged over 100 iterations for each

Appendix B Experimental Instructions for Section 2.3

Subjects on Mechanical Turk were presented with the following introduction screen:

Appendix C Supplementary Material to Section 2.3

c.1 Robustness

Here we check how our results in Section 2.3 change when the outcome space and error function are changed so that prediction functions are maps and the error for predicting the test data set is defined to be

i.e. the misclassification rate. We use as a naive benchmark the prediction rule that guesses and uniformly at random; this is guaranteed an expected misclassification rate of 0.50.

For this problem, the Table Lookup benchmark learns the modal continuation for each sequence in . We find that the completeness of Rabin (2002) and Rabin (2000) relative to the Table Lookup benchmark are respectively 19% and 9%.

Error Completeness Naive Benchmark 0.50 0 Rabin (2002) 0.45 19% (0.003) Rabin & Vayanos (2010) 0.475 9% (0.01) Table Lookup 0.23 1 (0.002)

c.2 Different Cuts of the Data

Initial strings only.

We repeat the analysis in Section 2.3 using data from all subjects, but only their first 25 strings. This selection accounts for potential fatigue in generation of the final strings, and leaves a total of 638 subjects and 15,950 strings. Prediction results for our main exercise are shown below using this alternative selection.

Error Completeness Naive Benchmark 0.25 0 Rabin & Vayanos (2010) 0.2491 5% (0.0008) Table Lookup 0.2326 100% (0.0030)
Removing the least random subjects.

For each subject, we conduct a Chi-squared test for the null hypothesis that their strings were generated under a Bernoulli process. We order subjects by

-values and remove the 100 subjects with the lowest -values (subjects whose generated strings were most different from what we would expect under a Bernoulli process). This leaves a total of 538 subjects and 24,550 strings. Prediction results for our main exercise are shown below using this alternative selection.

Error Completeness Naive Benchmark 0.25 0 Rabin & Vayanos (2010) 0.2491 12% (0.0005) Table Lookup 0.2427 100% (0.0016)