Game Design for Eliciting Distinguishable Behavior

12/12/2019 ∙ by Fan Yang, et al. ∙ Carnegie Mellon University 6

The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human behavior can vary widely across individuals. For instance, due to varying risk preferences, some people arrive extremely early at an airport, while others arrive the last minute. Being able to infer these latent psychological traits, such as risk preferences or discount factors for future rewards, is of broad multi-disciplinary interest, within psychology, behavioral economics, as well as machine learning. As machine learning finds broader societal usage, understanding users’ latent preferences is crucial to personalizing these data-driven systems to individual users.

In order to infer such psychological traits, which require cognitive rather than physiological assessment (e.g. blood tests), we need an interactive environment to engage users and elicit their behavior. Approaches to do so have ranged from questionnaires Cohen et al. (1983); Kroenke et al. (2001); Diener et al. (2010); Russell (1996) to games Bechara et al. (1994); Dombrovski et al. (2010); Moustafa et al. (2008); McGuire and Kable (2012) that involve planning and decision making. It is this latter approach of game that we consider in this paper. However, there has been some recent criticism of such manually-designed games Buelow and Suhr (2009); Charness et al. (2013); Crosetto and Filippin (2016). In particular, a game is said to be effective, or behavior diagnostic, if the differing latent traits of players can be accurately inferred based on their game play behavior. However, manually-designed games are typically specified using heuristics that may not always be reliable or efficient for distinguishing human traits given game play behavior.

As a motivating example, consider a game environment where the player can choose to stay or moveRight on a Path. Each state on the Path has a reward. The player accumulates the reward as they move to a state. Suppose players have different preferences (e.g. some might magnify positive reward and not care too much about negative reward, while others might be the opposite), but are otherwise rational, so that they choose optimal strategies given their respective preferences. If we want to use this game to tell apart such players, how should we assign reward to each state in the Path? Heuristically, one might suggest there should be positive reward to lure gain-seeking players and negative reward to discourage the loss-averse ones, as shown in Figure 0(a). However, as shown in Figure 0(b) and 1(a), the induced behavior (either policy or sampled trajectories) are similar for players with different loss preferences, and consequently not helpful in distinguishing them based on their game play behavior. In Figure 0(c), an alternative reward design is shown, which elicits more distinguishable behavior (see Figure 0(d) and 1(b)). This example illustrates that it is nontrivial to design effective games based on intuitive heuristics, and a more systematic approach is needed.

(a) Reward designed by heuristics
(b) Polices by different players (induced by reward in Figure 0(a))
(c) Reward designed by optimization
(d) Polices by different players (induced by reward in Figure 0(c))
Figure 1: Comparing reward designed by heuristics and by optimization. The game is a 6-state Markov Decision Process. Each state is represented as a square (see 0(a) or 0(c)) and player can choose stay or moveRight. The goal is to design reward for each state such that different types of players (Loss-Neutral, Gain-Seeking, Loss-Averse) have different behaviors. We show reward designed by heuristics in 0(a) and by optimization in 0(c)

. Using these rewards, the policies of different players are shown on the right. For each player, its policy specifics the probability of taking an action (

stay or moveRight) at each state. For example, Loss-Neutral’s policy in 0(d) shows that it is more likely to choose stay than moveRight at the first (i.e. left-most) state, while in the second to fifth states, choosing moveRight has a higher probability.
(a) Sampled trajectories using policies in Figure 0(b)
(b) Sampled trajectories using policies in Figure 0(d)
Figure 2: Comparing sampled trajectories using policies induced by different rewards. To further visualize how each type of players behave given different rewards, we sample trajectories using their induced policies. Given reward designed by heuristics (Figure 0(a)), all players behave similarly by traversing all the states (see 1(a)). However, given reward designed by optimization (Figure 0(c)), Gain-Seeking and Loss-Averse players behave differently. In particular, Loss-Averse chooses stay most of the time (see 1(b)), since the first state has a relatively large reward. Hence, reward designed by optimization is more effective at eliciting distinctive behaviors.

In this work, we formalize this task of behavior diagnostic game design, introducing a general framework consisting of a player space, game space, and interaction model. We use mutual information to quantitatively capture game effectiveness, and present a practical algorithm that optimizes a variational lower bound. We instantiate our framework by setting the player space using prospect theory Kahneman and Tversky (1979), and setting the game space and interaction model using Markov Decision Processes Howard (1960). Empirically, our quantitative optimization-based approach designs games that are more effective at inferring players’ latent traits, outperforming manually-designed games by a large margin. In addition, we study how the assumptions in our instantiation affect the effectiveness of the designed games, showing that they are robust to modest misspecification of the assumptions.

2 Behavior Diagnostic Game Design

We consider the problem of designing interactive games that are informative in the sense that a player’s type can be inferred from their play. A game-playing process contains three components: a player , a game and an interaction model . Here, we assume each player (which is represented by its latent trait) lies in some player space . We also denote

as the random variable corresponding to a randomly selected player from

with respect to some prior (e.g. uniform) distribution

over . Further, we assume there is a family of parameterized games . Given a player , a game , the interaction model describes how a behavioral observation from some observation space is generated. Specifically, each round of game play generates behavioral observations as , where the interaction model is some distribution over the observation space . In this work, we assume and are fixed and known. Our goal is to design a game such that the generated behavior observations are most informative for inferring the player .

2.1 Maximizing Mutual Information

Our problem formulation introduces a probabilistic model over the players (as specified by the prior distribution ) and the behavioral observations (as specified by ), so that . Our goal can then be interpreted as maximizing the information on contained in , which can be captured by the mutual information between and :

(1)
(2)

so that the mutual information is a function of the game parameters .

Definition 2.1.

(Behavior Diagnostic Game Design) Given a player space , a family of parameterized games , and an interaction model , our goal is to find:

(3)

2.2 Variational Mutual Information Maximization

It is difficult to directly optimize the mutual information objective in Eq (2), as it requires access to a posterior that does not have a closed analytical form. Following the derivations in Chen et al. (2016) and Agakov (2004), we opt to maximize a variational lower bound of the mutual information objective. Letting denote any variational distribution that approximates , and denote the marginal entropy of , we can bound the mutual information as:

(4)
(5)
(6)

so that the expression in Eq (6) forms a variational lower bound for the mutual information .

3 Instantiation: Prospect Theory based MDP Design

Our framework in Section 2 provides a systematic view of behavior diagnostic game design, and each of its components can be chosen based on contexts and applications. We present one instantiation by setting the player space , the game , and the interaction model . For the player space , we use prospect theory Kahneman and Tversky (1979) to describe how players perceive or distort values. We model the game as a Markov Decision Process Howard (1960). Finally, a (noisy) value iteration is used to model players’ planning and decision-making, which is part of the interaction model . In the next subsection, we provide a brief background of these key ingredients.

3.1 Background

Prospect Theory

Prospect theory Kahneman and Tversky (1979) describes the phenomenon that different people can perceive the same numerical values differently. For example, people who are averse to loss, e.g. it is better to not lose $5 than to win $5, magnify the negative reward or penalty. Following Ratliff et al. (2017), we use the following distortion function to describe how people shrink or magnify numerical values,

(7)

where is the reference point that people compare numerical values against. and are the amount of distortion applied to the positive and negative amount of the reward with respect to the reference point.

We use this framework of prospect theory to specify our player space. Specifically, we represent a player by their personalized distortion parameters, so that . In this work, unless we specify otherwise, we assume that is set to zero. Given these distortion parameters, the players perceive a distorted of any reward in the game, as we detail in the discussion of the interaction model subsequently.

Markov Decision Process

A Markov Decision Process (MDP) is defined by , where is the state space and is the action space. For each state-action pair ,

is a probability distribution over the next state.

specifies a reward function. is a discount factor. We assume that both states and actions are discrete and finite. For all , a policy defines a distribution over actions to take at state . A policy for an MDP is denoted as .

Value Iteration

Given a Markov Decision Process , value iteration is an algorithm that can be written as a simple update operation, which combines one sweep of policy evaluation and one sweep of policy improvement Sutton and Barto (2018),

(8)

and computes a value function for each state . A probabilistic policy can be defined similar to maximum entropy policy Ziebart (2010); Levine (2018) based on the value function, i.e.,

(9)

Value iteration converges to an optimal policy for discounted finite MDPs Sutton and Barto (2018).

3.2 Instantiation

In this instantiation, we consider the game , and design the reward function and the transition probabilities of the game by learning the parameters . We assume that each player behaves according to a noisy near-optimal policy defined by value iteration and an MDP with distorted reward . The game play has steps in total. The interaction model is then the distribution of the trajectories , where the player always starts at , and at each state , we sample an action using the Gumbel-max trick Gumbel (1954) with a noise parameter . Specifically, let the probability over actions be and be independent samples from , a sampled action can be defined as

(10)

When , there is no noise and distributes according to . The amount of noise increases as increases. Similarly, we sample the next state from using Gumbel-max, with always set to one.

Our goal in behavior diagnostic game design then reduces to solving the optimization in Eq (3), where the player space consisting of distortion parameters , the game space parameterized as , and an interaction model of trajectories generated by noisy value iteration using distorted reward ,

As discussed in the previous section, for computational tractability, we optimize the variational lower bound from Eq (6) on this mutual information. As the variational distribution

, we use a factored Gaussian, with means parameterized by a Recurrent Neural Network 

Hochreiter and Schmidhuber (1997). The input at each step is the concatenation of a one-hot (or softmax approximated) encoding of state and action

. We optimize this variational bound via stochastic gradient descent 

Kingma and Ba (2014). In order for the objective to be end-to-end differentiable, during trajectory sampling, we use the Gumbel-softmax trick (Jang et al., 2016; Maddison et al., 2017), which uses the softmax function as a continuous approximation to the argmax operator in Eq (10).

4 Experiments

4.1 Learning to Design Games

We learn to design games by maximizing the mutual information objective in Eq (6), with known player prior and interaction model

. We study how the degrees of freedom in games

affect the mutual information. In particular, we consider environments that are Path or Grid of various sizes. Path environment has two actions, stay and moveRight. And Grid has one additional action moveUp. Besides learning the reward , we also consider learning part of the transition . To be more specific, we learn the probability that the action moveRight actually stays in the same state. Therefore moveRight becomes a probabilistic action that at each state ,

(11)

We experiment with Path of length 6 and Grid of size 3 by 6. The player prior is uniform over . For the baseline, we manually design the reward for each state to be111

We have also experimented with different manually designed baseline reward functions. Their performances are similar to the presented baselines, both in terms of mutual information loss and player classifier accuracy. The performance of randomly generated rewards is worse than the manually designed baselines.

The intuition behind this design is that the positive reward at the end of the Path (or Grid) will encourage players to explore the environment, while the negative reward in the middle will discourage players that are averse to loss but not affect gain-seeking players, hence elicit distinctive behavior.

In Table 1, mutual information optimization losses for different settings are listed. The baselines have higher mutual information loss than other learnable settings. When only the reward is learnable, the Grid setting achieves slightly better mutual information than the Path one. However, when both reward and transition are learnable, the Grid setting significantly outperforms the others. This shows that our optimization-based approach can effectively search through the large game space and find the ones that are more informative.

Baselines Learn Reward Only Learn Reward and Transition
Path (1 x 6) Grid (3 x 6) Path (1 x 6) Grid (3 x 6) Path (1 x 6) Grid (3 x 6)
0.111 0.115 0.108 0.099 0.107 0.078
Table 1: Mutual Information Optimization Loss

4.2 Player Classification Using Designed Games

To further quantify the effectiveness of learned games, we create downstream classification tasks. In the classification setting, each data instance consists of a player label and its behavior (i.e. trajectory) . We assume that the player attribute is sampled from a mixture of Gaussians.

(12)

The label for each player corresponds to the component , and the trajectory is sampled from , where is a learned game. There are three types (i.e. components) of players, namely Loss-Neutral (), Gain-Seeking (), and Loss-Averse (). We simulate 1000 data instances for train, and 100 each for validation and test. We use a model similar to the one used for

, except for the last layer, which now outputs a categorical distribution. The optimization is run for 20 epochs and five rounds with different random seed. Validation set is used to select the test set performance. Mean test accuracy and its standard deviation are reported in Table 

2.

Baselines Learn Reward Only Learn Reward and Transition
Path (1 x 6) Grid (3 x 6) Path (1 x 6) Grid (3 x 6) Path (1 x 6) Grid (3 x 6)
0.442 (0.056) 0.482(.052) 0.678 (0.044) 0.658 (0.066) 0.686 (0.044) 0.822 (0.027)
Table 2: Classification Task Accuracy

Similar to the trend in mutual information, baseline methods have the lowest accuracies, which are about 35% less than any learned games. Grid with learned reward and transition outperforms other methods with a large margin of a 20% improvement in accuracy.

To get a more concrete understanding of learned games and the differences among them, we visualize two examples below. In Figure 2(a), the learned reward for each state in a Path environment is shown via heatmap. The learned reward is similar to the manually designed one—highest at the end and negative in the middle—but with an important twist: there are also smaller positive rewards at the beginning of the Path. This twist successfully induces distinguishable behavior from Loss-Averse players. The induced policy (Figure 2(b))) and sampled trajectories (Figure 2(c))) are very different between Loss-Averse and Gain-Seeking. However, Loss-Neutral and Loss-Averse players still behave quite similarly in this game.

(a) Learned reward for each state in a 1 x 6 Path
(b) Induced policies by different types of players, using reward in 2(a)
(c) Sampled trajectories by different types of players, using policies in 2(b)
Figure 3: A 1 x 6 Path with learned reward. Gain-Seeking and Loss-Averse behave distinctively.

In Figure 3(c) and 3(d), we show induced policies and sampled trajectories in a Grid environment where both reward and transition are learned. The learned game elicits distinctive behavior from different types of players. Loss-Averse players choose moveUp at the beginning and then always stay. Loss-Neutral players explore along the states in the bottom row, while Gain-Seeking players choose moveUp early on and explore the middle row. The learned reward and transition are visualized in Figure 3(a) and 3(b). The middle row is particular interesting. The states have very mixed reward—some are relatively high and some are the lowest. We conjecture that the possibility of stay (when take moveRight) at some states with high reward (e.g. the first and third state from left in the middle row) makes Gain-Seeking behave differently from Loss-Neutral.

(a) Learned reward for each state in a 3 x 6 Grid
(b) Learned transition at each state
(c) Induced policies by different types of players, using reward in 3(a) and transition in 3(b)
(d) Sampled trajectories by different types of players, using policies in 3(c) and transition in 3(b)
Figure 4: A 3 x 6 Grid with learned reward and transition (see 3(a) and 3(b)) elicit distinguishable behaviors from different types of players.

We also consider the case where the interaction model has noise, as described in Eq (10), when generating trajectories for classification data. In practice, it is unlikely that one interaction model describes all players, since players have a variety of individual differences. Hence it is important to study how effective the learned games are when downstream task interaction model is noisy and deviates from assumption. In Table 3, classification accuracy on test set is shown at different noise level. We consider three designs here. As defined above, a baseline method which uses manually designed reward in Path, a Path environment with learned reward, and a Grid environment with both learned reward and transition. Interestingly, while adding noise decreases classification performance of learned games, the manually designed game (i.e. baseline method) achieves higher accuracy in the presence of noise. Nevertheless, the learned Grid outperforms others, though the margin decreases from 20% to 12%.

Path (1 x 6, Baseline) 0.442 (0.056) 0.510 (0.053) 0.482 (0.041)
Path (1 x 6, Learn Reward Only) 0.678 (0.044) 0.678 (0.039) 0.650 (0.048)
Grid (3 x 6, Learn Reward and Transition) 0.822 (0.027) 0.778 (0.044) 0.730 (0.061)
Table 3: Classification Accuracy at Different Noise Level

In Figure 5, we visualize the trajectories when the noise in the interaction model is . This provides intuition for why the classification performance decreases, as the boundary between the behavior of Loss-Neutral and Gain-Seeking is blurred.

Figure 5: Sampled trajectories with noise

4.3 Ablation Study

Lastly, we consider using different distributions for player prior on , which could be agnostic of the downstream tasks or not. We compare the classification performance when is uniform or biased towards the distribution of player types. We consider two cases: Full and Diagonal. In Full, the player prior is uniform over . In Diagonal, is uniform over the union of and , which is a strict subset of the Full case and arguably closer to the player types distribution in the classification task. In Table 4, we show performance of games learned with Full or Diagonal player prior.

Method Mutual Information Loss Classification Accuracy
Full Diagonal Full Diagonal
Reward Only Path 0.108 0.043 0.678 (0.044) 0.658 (0.034)
Reward Only Grid 0.099 0.039 0.658 (0.066) 0.662 (0.060)
Reward and Transition Path 0.107 0.043 0.686 (0.044) 0.668 (0.048)
Reward and Transition Grid 0.078 0.036 0.822 (0.027) 0.712 (0.051)
Table 4: Comparison of Learned Games with Different Player Prior

Across all methods, using the Diagonal prior achieves lower mutual information loss compared to using the Full one. However, this trend does not generalize to classification. Using the Diagonal prior hurts classification accuracy. We visualize the sampled trajectories in Figure 6. As we can see, Loss-Neutral no longer has its own distinctive behavior, which is the case using Full prior (see Figure 3(d)). It seems that learned game is more likely to overfit the Diagonal prior, which leads to poor generalization on downstream tasks. Therefore, using a play prior agnostic to downstream task might be preferred.

Figure 6: Sampled trajectories using Diagonal player prior

5 Conclusion and Discussion

We consider designing games for the purpose of distinguishing among different types of players. We propose a general framework and use mutual information to quantify the effectiveness. Comparing with games designed by heuristics, our optimization-based designs elicit more distinctive behaviors. Our behavior-diagnostic game design framework can be applied to various applications, with player space, game space and interaction model instantiated by domain-specific ones. For example, Nielsen et al. (2015) studies how to generate games for the purpose of differentiating players using player performance instead of behavior trajectory as the observation space. In addition, we have considered the case when the player traits inferred from their game playing behavior are stationary. However, as pointed out by Tekofsky et al. (2013); Yee et al. (2011); Canossa et al. (2013), there can be complex relationships between players’ in-game and outside-game personality traits. In future work, we look forward to addressing this distribution shift.

Acknowledgement

W.C. acknowledges the support of Google. L.L. and P.R. acknowledge the support of ONR via N000141812861. Z.L. acknowledges the support of the AI Ethics and Governance Fund. This research was supported in part by a grant from J. P. Morgan.

References

  • D. B. F. Agakov (2004) The im algorithm: a variational approach to information maximization. Advances in Neural Information Processing Systems 16, pp. 201. Cited by: §2.2.
  • A. Bechara, A. R. Damasio, H. Damasio, and S. W. Anderson (1994) Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50 (1-3), pp. 7–15. Cited by: §1.
  • M. T. Buelow and J. A. Suhr (2009) Construct validity of the iowa gambling task. Neuropsychology review 19 (1), pp. 102–114. Cited by: §1.
  • A. Canossa, J. B. Martinez, and J. Togelius (2013) Give me a reason to dig minecraft and psychology of motivation. In Conference on Computational Intelligence in Games (CIG), Cited by: §5.
  • G. Charness, U. Gneezy, and A. Imas (2013) Experimental methods: eliciting risk preferences. Journal of Economic Behavior & Organization 87, pp. 43–51. Cited by: §1.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.2.
  • S. Cohen, T. Kamarck, and R. Mermelstein (1983) A global measure of perceived stress. Journal of health and social behavior, pp. 385–396. Cited by: §1.
  • P. Crosetto and A. Filippin (2016) A theoretical and experimental appraisal of four risk elicitation methods. Experimental Economics 19 (3), pp. 613–641. Cited by: §1.
  • E. Diener, D. Wirtz, W. Tov, C. Kim-Prieto, D. Choi, S. Oishi, and R. Biswas-Diener (2010) New well-being measures: short scales to assess flourishing and positive and negative feelings. Social Indicators Research 97 (2), pp. 143–156. Cited by: §1.
  • A. Y. Dombrovski, L. Clark, G. J. Siegle, M. A. Butters, N. Ichikawa, B. J. Sahakian, and K. Szanto (2010) Reward/punishment reversal learning in older suicide attempters. American Journal of Psychiatry 167 (6), pp. 699–707. Cited by: §1.
  • E. J. Gumbel (1954) Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §3.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • R. A. Howard (1960) Dynamic programming and markov processes.. Cited by: §1, §3.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.2.
  • D. Kahneman and A. Tversky (1979) Prospect theory: an analysis of decision under risk. Econometrica 47 (2), pp. 263–292. Cited by: §1, §3.1, §3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • K. Kroenke, R. L. Spitzer, and J. B. Williams (2001) The phq-9: validity of a brief depression severity measure. Journal of general internal medicine 16 (9), pp. 606–613. Cited by: §1.
  • S. Levine (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §3.1.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

    .
    In International Conference on Learning Representations, Cited by: §3.2.
  • J. T. McGuire and J. W. Kable (2012) Decision makers calibrate behavioral persistence on the basis of time-interval experience. Cognition 124 (2), pp. 216–226. Cited by: §1.
  • A. A. Moustafa, M. X. Cohen, S. J. Sherman, and M. J. Frank (2008) A role for dopamine in temporal decision making and reward maximization in parkinsonism. Journal of Neuroscience 28 (47), pp. 12294–12304. Cited by: §1.
  • T. S. Nielsen, G. A. Barros, J. Togelius, and M. J. Nelson (2015) Towards generating arcade game rules with vgdl. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 185–192. Cited by: §5.
  • L. J. Ratliff, E. Mazumdar, and T. Fiez (2017) Risk-sensitive inverse reinforcement learning via gradient methods. arXiv preprint arXiv:1703.09842. Cited by: §3.1.
  • D. W. Russell (1996) UCLA loneliness scale (version 3): reliability, validity, and factor structure. Journal of personality assessment 66 (1), pp. 20–40. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.1.
  • S. Tekofsky, J. Van Den Herik, P. Spronck, and A. Plaat (2013) Psyops: personality assessment through gaming behavior. In International Conference on the Foundations of Digital Games, Cited by: §5.
  • N. Yee, N. Ducheneaut, L. Nelson, and P. Likarish (2011) Introverted elves & conscientious gnomes: the expression of personality in world of warcraft. In Conference on Human Factors in Computing Systems (CHI), Cited by: §5.
  • B. D. Ziebart (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §3.1.