Human behavior can vary widely across individuals. For instance, due to varying risk preferences, some people arrive extremely early at an airport, while others arrive the last minute. Being able to infer these latent psychological traits, such as risk preferences or discount factors for future rewards, is of broad multi-disciplinary interest, within psychology, behavioral economics, as well as machine learning. As machine learning finds broader societal usage, understanding users’ latent preferences is crucial to personalizing these data-driven systems to individual users.
In order to infer such psychological traits, which require cognitive rather than physiological assessment (e.g. blood tests), we need an interactive environment to engage users and elicit their behavior. Approaches to do so have ranged from questionnaires Cohen et al. (1983); Kroenke et al. (2001); Diener et al. (2010); Russell (1996) to games Bechara et al. (1994); Dombrovski et al. (2010); Moustafa et al. (2008); McGuire and Kable (2012) that involve planning and decision making. It is this latter approach of game that we consider in this paper. However, there has been some recent criticism of such manually-designed games Buelow and Suhr (2009); Charness et al. (2013); Crosetto and Filippin (2016). In particular, a game is said to be effective, or behavior diagnostic, if the differing latent traits of players can be accurately inferred based on their game play behavior. However, manually-designed games are typically specified using heuristics that may not always be reliable or efficient for distinguishing human traits given game play behavior.
As a motivating example, consider a game environment where the player can choose to stay or moveRight on a Path. Each state on the Path has a reward. The player accumulates the reward as they move to a state. Suppose players have different preferences (e.g. some might magnify positive reward and not care too much about negative reward, while others might be the opposite), but are otherwise rational, so that they choose optimal strategies given their respective preferences. If we want to use this game to tell apart such players, how should we assign reward to each state in the Path? Heuristically, one might suggest there should be positive reward to lure gain-seeking players and negative reward to discourage the loss-averse ones, as shown in Figure 0(a). However, as shown in Figure 0(b) and 1(a), the induced behavior (either policy or sampled trajectories) are similar for players with different loss preferences, and consequently not helpful in distinguishing them based on their game play behavior. In Figure 0(c), an alternative reward design is shown, which elicits more distinguishable behavior (see Figure 0(d) and 1(b)). This example illustrates that it is nontrivial to design effective games based on intuitive heuristics, and a more systematic approach is needed.
. Using these rewards, the policies of different players are shown on the right. For each player, its policy specifics the probability of taking an action (stay or moveRight) at each state. For example, Loss-Neutral’s policy in 0(d) shows that it is more likely to choose stay than moveRight at the first (i.e. left-most) state, while in the second to fifth states, choosing moveRight has a higher probability.
In this work, we formalize this task of behavior diagnostic game design, introducing a general framework consisting of a player space, game space, and interaction model. We use mutual information to quantitatively capture game effectiveness, and present a practical algorithm that optimizes a variational lower bound. We instantiate our framework by setting the player space using prospect theory Kahneman and Tversky (1979), and setting the game space and interaction model using Markov Decision Processes Howard (1960). Empirically, our quantitative optimization-based approach designs games that are more effective at inferring players’ latent traits, outperforming manually-designed games by a large margin. In addition, we study how the assumptions in our instantiation affect the effectiveness of the designed games, showing that they are robust to modest misspecification of the assumptions.
2 Behavior Diagnostic Game Design
We consider the problem of designing interactive games that are informative in the sense that a player’s type can be inferred from their play. A game-playing process contains three components: a player , a game and an interaction model . Here, we assume each player (which is represented by its latent trait) lies in some player space . We also denote
as the random variable corresponding to a randomly selected player from
with respect to some prior (e.g. uniform) distributionover . Further, we assume there is a family of parameterized games . Given a player , a game , the interaction model describes how a behavioral observation from some observation space is generated. Specifically, each round of game play generates behavioral observations as , where the interaction model is some distribution over the observation space . In this work, we assume and are fixed and known. Our goal is to design a game such that the generated behavior observations are most informative for inferring the player .
2.1 Maximizing Mutual Information
Our problem formulation introduces a probabilistic model over the players (as specified by the prior distribution ) and the behavioral observations (as specified by ), so that . Our goal can then be interpreted as maximizing the information on contained in , which can be captured by the mutual information between and :
so that the mutual information is a function of the game parameters .
(Behavior Diagnostic Game Design) Given a player space , a family of parameterized games , and an interaction model , our goal is to find:
2.2 Variational Mutual Information Maximization
It is difficult to directly optimize the mutual information objective in Eq (2), as it requires access to a posterior that does not have a closed analytical form. Following the derivations in Chen et al. (2016) and Agakov (2004), we opt to maximize a variational lower bound of the mutual information objective. Letting denote any variational distribution that approximates , and denote the marginal entropy of , we can bound the mutual information as:
so that the expression in Eq (6) forms a variational lower bound for the mutual information .
3 Instantiation: Prospect Theory based MDP Design
Our framework in Section 2 provides a systematic view of behavior diagnostic game design, and each of its components can be chosen based on contexts and applications. We present one instantiation by setting the player space , the game , and the interaction model . For the player space , we use prospect theory Kahneman and Tversky (1979) to describe how players perceive or distort values. We model the game as a Markov Decision Process Howard (1960). Finally, a (noisy) value iteration is used to model players’ planning and decision-making, which is part of the interaction model . In the next subsection, we provide a brief background of these key ingredients.
Prospect theory Kahneman and Tversky (1979) describes the phenomenon that different people can perceive the same numerical values differently. For example, people who are averse to loss, e.g. it is better to not lose $5 than to win $5, magnify the negative reward or penalty. Following Ratliff et al. (2017), we use the following distortion function to describe how people shrink or magnify numerical values,
where is the reference point that people compare numerical values against. and are the amount of distortion applied to the positive and negative amount of the reward with respect to the reference point.
We use this framework of prospect theory to specify our player space. Specifically, we represent a player by their personalized distortion parameters, so that . In this work, unless we specify otherwise, we assume that is set to zero. Given these distortion parameters, the players perceive a distorted of any reward in the game, as we detail in the discussion of the interaction model subsequently.
Markov Decision Process
A Markov Decision Process (MDP) is defined by , where is the state space and is the action space. For each state-action pair ,
is a probability distribution over the next state.specifies a reward function. is a discount factor. We assume that both states and actions are discrete and finite. For all , a policy defines a distribution over actions to take at state . A policy for an MDP is denoted as .
Given a Markov Decision Process , value iteration is an algorithm that can be written as a simple update operation, which combines one sweep of policy evaluation and one sweep of policy improvement Sutton and Barto (2018),
Value iteration converges to an optimal policy for discounted finite MDPs Sutton and Barto (2018).
In this instantiation, we consider the game , and design the reward function and the transition probabilities of the game by learning the parameters . We assume that each player behaves according to a noisy near-optimal policy defined by value iteration and an MDP with distorted reward . The game play has steps in total. The interaction model is then the distribution of the trajectories , where the player always starts at , and at each state , we sample an action using the Gumbel-max trick Gumbel (1954) with a noise parameter . Specifically, let the probability over actions be and be independent samples from , a sampled action can be defined as
When , there is no noise and distributes according to . The amount of noise increases as increases. Similarly, we sample the next state from using Gumbel-max, with always set to one.
Our goal in behavior diagnostic game design then reduces to solving the optimization in Eq (3), where the player space consisting of distortion parameters , the game space parameterized as , and an interaction model of trajectories generated by noisy value iteration using distorted reward ,
As discussed in the previous section, for computational tractability, we optimize the variational lower bound from Eq (6) on this mutual information. As the variational distribution
, we use a factored Gaussian, with means parameterized by a Recurrent Neural NetworkHochreiter and Schmidhuber (1997). The input at each step is the concatenation of a one-hot (or softmax approximated) encoding of state and action
. We optimize this variational bound via stochastic gradient descentKingma and Ba (2014). In order for the objective to be end-to-end differentiable, during trajectory sampling, we use the Gumbel-softmax trick (Jang et al., 2016; Maddison et al., 2017), which uses the softmax function as a continuous approximation to the argmax operator in Eq (10).
4.1 Learning to Design Games
We learn to design games by maximizing the mutual information objective in Eq (6), with known player prior and interaction model
. We study how the degrees of freedom in gamesaffect the mutual information. In particular, we consider environments that are Path or Grid of various sizes. Path environment has two actions, stay and moveRight. And Grid has one additional action moveUp. Besides learning the reward , we also consider learning part of the transition . To be more specific, we learn the probability that the action moveRight actually stays in the same state. Therefore moveRight becomes a probabilistic action that at each state ,
We experiment with Path of length 6 and Grid of size 3 by 6.
The player prior is uniform over .
For the baseline, we manually design the reward
for each state to be111 We have also experimented with different manually designed baseline reward functions. Their performances are similar to the presented baselines, both in terms of mutual information loss and player classifier accuracy. The performance of randomly generated rewards is worse than the manually designed baselines.
We have also experimented with different manually designed baseline reward functions. Their performances are similar to the presented baselines, both in terms of mutual information loss and player classifier accuracy. The performance of randomly generated rewards is worse than the manually designed baselines.
The intuition behind this design is that the positive reward at the end of the Path (or Grid) will encourage players to explore the environment, while the negative reward in the middle will discourage players that are averse to loss but not affect gain-seeking players, hence elicit distinctive behavior.
In Table 1, mutual information optimization losses for different settings are listed. The baselines have higher mutual information loss than other learnable settings. When only the reward is learnable, the Grid setting achieves slightly better mutual information than the Path one. However, when both reward and transition are learnable, the Grid setting significantly outperforms the others. This shows that our optimization-based approach can effectively search through the large game space and find the ones that are more informative.
|Baselines||Learn Reward Only||Learn Reward and Transition|
|Path (1 x 6)||Grid (3 x 6)||Path (1 x 6)||Grid (3 x 6)||Path (1 x 6)||Grid (3 x 6)|
4.2 Player Classification Using Designed Games
To further quantify the effectiveness of learned games, we create downstream classification tasks. In the classification setting, each data instance consists of a player label and its behavior (i.e. trajectory) . We assume that the player attribute is sampled from a mixture of Gaussians.
The label for each player corresponds to the component , and the trajectory is sampled from , where is a learned game. There are three types (i.e. components) of players, namely Loss-Neutral (), Gain-Seeking (), and Loss-Averse (). We simulate 1000 data instances for train, and 100 each for validation and test. We use a model similar to the one used for
, except for the last layer, which now outputs a categorical distribution. The optimization is run for 20 epochs and five rounds with different random seed. Validation set is used to select the test set performance. Mean test accuracy and its standard deviation are reported in Table2.
|Baselines||Learn Reward Only||Learn Reward and Transition|
|Path (1 x 6)||Grid (3 x 6)||Path (1 x 6)||Grid (3 x 6)||Path (1 x 6)||Grid (3 x 6)|
|0.442 (0.056)||0.482(.052)||0.678 (0.044)||0.658 (0.066)||0.686 (0.044)||0.822 (0.027)|
Similar to the trend in mutual information, baseline methods have the lowest accuracies, which are about 35% less than any learned games. Grid with learned reward and transition outperforms other methods with a large margin of a 20% improvement in accuracy.
To get a more concrete understanding of learned games and the differences among them, we visualize two examples below. In Figure 2(a), the learned reward for each state in a Path environment is shown via heatmap. The learned reward is similar to the manually designed one—highest at the end and negative in the middle—but with an important twist: there are also smaller positive rewards at the beginning of the Path. This twist successfully induces distinguishable behavior from Loss-Averse players. The induced policy (Figure 2(b))) and sampled trajectories (Figure 2(c))) are very different between Loss-Averse and Gain-Seeking. However, Loss-Neutral and Loss-Averse players still behave quite similarly in this game.
In Figure 3(c) and 3(d), we show induced policies and sampled trajectories in a Grid environment where both reward and transition are learned. The learned game elicits distinctive behavior from different types of players. Loss-Averse players choose moveUp at the beginning and then always stay. Loss-Neutral players explore along the states in the bottom row, while Gain-Seeking players choose moveUp early on and explore the middle row. The learned reward and transition are visualized in Figure 3(a) and 3(b). The middle row is particular interesting. The states have very mixed reward—some are relatively high and some are the lowest. We conjecture that the possibility of stay (when take moveRight) at some states with high reward (e.g. the first and third state from left in the middle row) makes Gain-Seeking behave differently from Loss-Neutral.
We also consider the case where the interaction model has noise, as described in Eq (10), when generating trajectories for classification data. In practice, it is unlikely that one interaction model describes all players, since players have a variety of individual differences. Hence it is important to study how effective the learned games are when downstream task interaction model is noisy and deviates from assumption. In Table 3, classification accuracy on test set is shown at different noise level. We consider three designs here. As defined above, a baseline method which uses manually designed reward in Path, a Path environment with learned reward, and a Grid environment with both learned reward and transition. Interestingly, while adding noise decreases classification performance of learned games, the manually designed game (i.e. baseline method) achieves higher accuracy in the presence of noise. Nevertheless, the learned Grid outperforms others, though the margin decreases from 20% to 12%.
|Path (1 x 6, Baseline)||0.442 (0.056)||0.510 (0.053)||0.482 (0.041)|
|Path (1 x 6, Learn Reward Only)||0.678 (0.044)||0.678 (0.039)||0.650 (0.048)|
|Grid (3 x 6, Learn Reward and Transition)||0.822 (0.027)||0.778 (0.044)||0.730 (0.061)|
In Figure 5, we visualize the trajectories when the noise in the interaction model is . This provides intuition for why the classification performance decreases, as the boundary between the behavior of Loss-Neutral and Gain-Seeking is blurred.
4.3 Ablation Study
Lastly, we consider using different distributions for player prior on , which could be agnostic of the downstream tasks or not. We compare the classification performance when is uniform or biased towards the distribution of player types. We consider two cases: Full and Diagonal. In Full, the player prior is uniform over . In Diagonal, is uniform over the union of and , which is a strict subset of the Full case and arguably closer to the player types distribution in the classification task. In Table 4, we show performance of games learned with Full or Diagonal player prior.
|Method||Mutual Information Loss||Classification Accuracy|
|Reward Only||Path||0.108||0.043||0.678 (0.044)||0.658 (0.034)|
|Reward Only||Grid||0.099||0.039||0.658 (0.066)||0.662 (0.060)|
|Reward and Transition||Path||0.107||0.043||0.686 (0.044)||0.668 (0.048)|
|Reward and Transition||Grid||0.078||0.036||0.822 (0.027)||0.712 (0.051)|
Across all methods, using the Diagonal prior achieves lower mutual information loss compared to using the Full one. However, this trend does not generalize to classification. Using the Diagonal prior hurts classification accuracy. We visualize the sampled trajectories in Figure 6. As we can see, Loss-Neutral no longer has its own distinctive behavior, which is the case using Full prior (see Figure 3(d)). It seems that learned game is more likely to overfit the Diagonal prior, which leads to poor generalization on downstream tasks. Therefore, using a play prior agnostic to downstream task might be preferred.
5 Conclusion and Discussion
We consider designing games for the purpose of distinguishing among different types of players. We propose a general framework and use mutual information to quantify the effectiveness. Comparing with games designed by heuristics, our optimization-based designs elicit more distinctive behaviors. Our behavior-diagnostic game design framework can be applied to various applications, with player space, game space and interaction model instantiated by domain-specific ones. For example, Nielsen et al. (2015) studies how to generate games for the purpose of differentiating players using player performance instead of behavior trajectory as the observation space. In addition, we have considered the case when the player traits inferred from their game playing behavior are stationary. However, as pointed out by Tekofsky et al. (2013); Yee et al. (2011); Canossa et al. (2013), there can be complex relationships between players’ in-game and outside-game personality traits. In future work, we look forward to addressing this distribution shift.
W.C. acknowledges the support of Google. L.L. and P.R. acknowledge the support of ONR via N000141812861. Z.L. acknowledges the support of the AI Ethics and Governance Fund. This research was supported in part by a grant from J. P. Morgan.
- The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems 16, pp. 201. Cited by: §2.2.
- Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50 (1-3), pp. 7–15. Cited by: §1.
- Construct validity of the iowa gambling task. Neuropsychology review 19 (1), pp. 102–114. Cited by: §1.
- Give me a reason to dig minecraft and psychology of motivation. In Conference on Computational Intelligence in Games (CIG), Cited by: §5.
- Experimental methods: eliciting risk preferences. Journal of Economic Behavior & Organization 87, pp. 43–51. Cited by: §1.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
- A global measure of perceived stress. Journal of health and social behavior, pp. 385–396. Cited by: §1.
- A theoretical and experimental appraisal of four risk elicitation methods. Experimental Economics 19 (3), pp. 613–641. Cited by: §1.
- New well-being measures: short scales to assess flourishing and positive and negative feelings. Social Indicators Research 97 (2), pp. 143–156. Cited by: §1.
- Reward/punishment reversal learning in older suicide attempters. American Journal of Psychiatry 167 (6), pp. 699–707. Cited by: §1.
- Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §3.2.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
- Dynamic programming and markov processes.. Cited by: §1, §3.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2.
- Prospect theory: an analysis of decision under risk. Econometrica 47 (2), pp. 263–292. Cited by: §1, §3.1, §3.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
- The phq-9: validity of a brief depression severity measure. Journal of general internal medicine 16 (9), pp. 606–613. Cited by: §1.
- Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §3.1.
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In International Conference on Learning Representations, Cited by: §3.2.
- Decision makers calibrate behavioral persistence on the basis of time-interval experience. Cognition 124 (2), pp. 216–226. Cited by: §1.
- A role for dopamine in temporal decision making and reward maximization in parkinsonism. Journal of Neuroscience 28 (47), pp. 12294–12304. Cited by: §1.
- Towards generating arcade game rules with vgdl. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 185–192. Cited by: §5.
- Risk-sensitive inverse reinforcement learning via gradient methods. arXiv preprint arXiv:1703.09842. Cited by: §3.1.
- UCLA loneliness scale (version 3): reliability, validity, and factor structure. Journal of personality assessment 66 (1), pp. 20–40. Cited by: §1.
- Reinforcement learning: an introduction. MIT press. Cited by: §3.1.
- Psyops: personality assessment through gaming behavior. In International Conference on the Foundations of Digital Games, Cited by: §5.
- Introverted elves & conscientious gnomes: the expression of personality in world of warcraft. In Conference on Human Factors in Computing Systems (CHI), Cited by: §5.
- Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §3.1.