1 Introduction
As the ability to collect larger volumes of increasingly complex neural data increases, so has the interest of neuroscientists in paradigms that investigate complex, naturalistic behavior [Pearson et al., 2014]. However, typical analysis methods in systems neuroscience often require that experimenters identify in advance specific events around which to collect and average data across trials. Thus, while much attention has been paid to the development of algorithms for flexibly analyzing largescale brain data (e.g., [Freeman et al., 2014, Pnevmatikakis et al., 2016, Pandarinath et al., 2017]), much less effort has been devoted to the analysis of rich behavioral data, particularly multiagent data. Here, we propose a new method for inferring latent goals from multiagent behavioral data in the form of multivariate time series. Taking as our example movement data from a dynamic twoplayer task, we show that these latent goals can furnish a parsimonious account of players’ interactions while decoupling intentions from the control needed to execute them.
Most previous work in the decision science and psychology community has used behavioral task designs that either (a) discretize the action and state spaces in order to reduce the dynamics to a discrete choice problem (e.g., Zheng et al. [2016]) or (b) focus on theoreticallymotivated rules that give rise to complex behaviors (e.g., Moussaïd et al. [2009]). Here, we focus on a different problem, one that edges closer to natural behavior: tasks that involve a continuous state space, controlled by continuous (joystick) inputs, and simultaneous movement by both players. This type of task presents a set of challenges that have received comparatively little treatment in the neuroscience literature: (1) unconstrained behavior that does not easily separate into discrete categories, (2) a lack of meaningful variables against which to align neural data and average over, and (3) complexity that surpasses the capability of simple, interpretable linear models to capture. Our approach strives to solve these problems, using inspiration from the inverse reinforcement learning literature Ng et al. [2000], Abbeel and Ng [2004], Dvijotham and Todorov [2010], which attempts to use a model to replicate behaviors that match exemplars generated by an expert demonstrator.
We propose a model that learns policies based on the instantaneous goals of each player. These goals are in turn generated based on an underlying joint value function for the two players. In particular, we are interested in abstracting away details of motor planning (the control problem), focusing instead on modeling the evolution of each player’s goal, represented as a desired onscreen position. These goals are chosen at each instant based on both the current goal and state of the game, and their values drive joystick input via a simple control model. The resulting value functions are both parsimonious as a description of the game’s dynamics and useful as a means of understanding players’ choices, since in cases where goals are onscreen locations, they can be directly visualized. Furthermore, goals and value functions provide useful correlates for neural analysis, including trial likelihood, instantaneous expected value (for each player), and entropy.
The outline of the paper is as follows: In Section 2.2, we outline our assumptions about the data and control model. In Section 3, we describe our model for the latent goal states based on Gaussian mixtures. In Section 4 we describe our full generative model for the data along with a variational Bayes algorithm for performing approximate inference on the unknown model parameters. In Section 5 we describe our training procedure and compare the behavior of our model with the dynamics exhibited by real players in the same task. Section 6 references related work and details our novel contributions. Section 7 concludes and discusses potential applications of our model.
2 Data and control model
2.1 Task
Our data consist of 10 sessions ( bouts) of a simple twoplayer “penalty shot” video game played by pairs of rhesus macaques in a neuroscience experiment. Each player used a joystick to control either a “ball” able to move in the x and y directions or a “goalie” only able to move in the y direction (Figure 0(a); movie). The objective of the player controlling the puck was to move it across a goal line at the righthand side of the screen, while the objective of the goalie was to intercept the puck. Each bout of play ended when one of these two outcomes obtained. This game can alternatively be thought of as a modified version of the Atari game “Pong”, with one player directly controlling the ball and no rebound from the paddle.
2.2 Model Intuition
The aim of our model is to use a set of observed movement trajectories from each player in the game to infer underlying intentions. That is, we assume that at each moment, each player has a latent goal, an onscreen position toward which he intends to move his avatar. These goals drive joystick inputs via a simple control model, with the goals of each player coevolving in time according to some joint dynamics. Control signals are translated by the “physics” of the task code into the onscreen movement of the player avatars, which then inform the players’ choices of goals in at the next time step. The result is a generative model capable of producing entirely new behavior that captures the variability present in real opponents. For our case study, the data consist of onscreen trajectories of playercontrolled avatars, concatenated into a multivariate time series, but we emphasize that our model is more generally applicable to any multivariate time series in which the assumption of an underlying time and statedependent setpoint for a control process is valid.
2.3 Data
Data consisted of onscreen positions of both avatars at regularly sampled time points for each bout of play (Figure 0(a)). We normalize each coordinate such that , and we also consider a state variable for the system on which agents’ behavior might be conditioned. For this work, we restrict this information to consist of instantaneous positions and velocities for both players: . We assume that the training set consists of a large number of bouts of play, with each bout constituting a single realization of each time series.^{1}^{1}1In this work, we assume these realizations are iid, though future work might consider modeling changes in the value function of each agent during the course of play. We further assume that control inputs are translated into onscreen variables via
(1) 
with a scaling between control and velocity,
a sigmoid function reflecting saturation of control signals, and
elementwise multiplication.2.4 Control model
We assume that at each moment in time, each agent has a desired goal state (here an onscreen position) and a controller capable of achieving this state by minimizing the error . For simplicity, we assume that this minimization is performed by a proportional integral derivative (PID) controller, which takes the discrete form (for a single control variable)
(2) 
where is the proportional control constant, is the time step, and and are the integration and derivative time scales, respectively. More generally, we can write the change in control signal as a convolution: , with given by the filter defined by in (2). In what follows, we will also assume some Gaussian uncertainty in the control signal
(3) 
while keeping the relationship (1) between and
deterministic. Here, we have written the update for a single variable. In the multivariate case, we assume control (and control noise) is independent for each dimension of the combined vector
.3 Goal model
For the goal time series, we will assume a Markov process in which new goals are probabilistically selected at each time based on both the current goal and the current state of the system. That is,
(4) 
More specifically, we will assume that at each time point, there exists a function that captures the benefit in setting a particular goal at the next time step based on the current state of the system. That is, we want to increase as often as possible. However, we add as a regularization constraint the idea that there should be some cost to large changes in goals, which we take to be quadratic in the distance between successive points. Explicitly, let
(5) 
Here, is the the predicted control and and govern the noise in the observations and goal diffusion, respectively. In what follows, we will also find it useful to define in order to write with
(6) 
This formulation admits multiple interpretations: In analogy with the path integral formulation of stochastic processes, it can be viewed as a model in which the “potential energy” trades off at each time step with the “kinetic energy” . In the limit of small /small /large , one is in either the lowtemperature thermodynamic limit or the highmass classical limit, and is a spatiallyvarying Gaussian process. Alternately, in the limit of large , goals are simply chosen independently at each time point. In any case, we have made the strong assumption that dependence of on occurs only through a momentum term, requiring that the “static” term carries most of the weight of explanation.
Unfortunately, for general , the distribution implied by (6) is of the BoltzmannGibbs form and is impossible to sample efficiently. And while it is possible in principle to train a generative network to sample innovations in the time series directly, the goal of our inference is to model itself, since this captures the strategic interplay between the two opponents’ goals.
3.1 as a Gaussian Mixture Model
However, for any defining an absolutely continuous Boltzmann distribution it is possible to approximate the potential energy piece of (6) as a finite mixture of Gaussians [Park and Sandberg, 1991]:
(7) 
with . Now, sampling from is simply a matter of first sampling a mode from the mixture and subsequently taking from
(8) 
which is equivalent to
(9) 
That is, conditioned on previous goal position, , , and
, the new goal is simply governed by a normal distribution. This in turn implies that the distribution of
with marginalized out is itself a mixture of normals with the same weights . Thus, given that we can efficiently sample , we have a means of sampling entire goal trajectories.We model each of the parameters , , and
of our Gaussian Mixture Model with a multilayer perceptron that takes the state of the system as its input and outputs a parameter for each mode
. This allows for our model to learn complex nonlinear relationships between state and value while remaining easy to sample from. Though we have defined the model with a full covariance structure, for the results below, we used a diagonal covariance . Likewise, we use a Gaussian mixture for the initial distribution over goals: .4 Full model and inference
4.1 Full generative model
4.2 Approximate Bayesian inference
Given the observed system trajectory (equivalently, the observed control signal ), we need to make inferences about the underlying goal trajectory , and furthermore, the parameters of the value function and
. In general, full Bayesian inference is intractable, but we employ a variational Bayes (VB) approach
Beal [2003], Wainwright et al. [2008]that approximates this procedure. In brief, VB attempts to minimize the KullbackLeibler divergence between a known generative model for which inference is intractable,
with data and latent variables , and an approximating family of posterior distributions, . This is equivalent to maximizing an evidence lower bound (ELBO) given by(10) 
with the entropy of the approximating posterior. That is, inference is transformed into an optimization problem in the parameters of the approximate posterior , amenable to solution by gradient ascent. In our model, we make use of recently developed “black box” methods Ranganath et al. [2014], Kucukelbir et al. [2015], Rezende et al. [2014], Kingma and Welling [2013] in which the gradients of the ELBO are replaced with stochastic approximations derived by sampling from , avoiding often difficult computations of the expectation in (10). Thus our only requirement for is that it be straightforward to sample from.
In our case, we begin with the generative model specified by (4) and detailed in Algorithm 1. For posteriors, we used the variational latent dynamical system (VLDS) from Archer et al. [2015], Gao et al. [2016] for . As in Rezende et al. [2014], Kingma and Welling [2013] we use samples from the posterior model in order to update both the parameters of the generative model (, ,
, and the weights of the neural networks parameterizing
, , and ) and the parameters of the approximate posterior via gradient ascent.^{2}^{2}2We do not explicitly model , though is normal at each time step. Since , this is effectively a delta function..5 Case study: Penalty Shot Game
5.1 Training
Training was done in an endtoend fashing using variational inference. We trained using a singletrial black box approximation to the evidence lower bound Ranganath et al. [2014], Kucukelbir et al. [2015]. This consisted of 1) drawing a random trial, 2) sampling inferred goals from , 3) using that sample to calculate the evidence lower bound (ELBO), and 4) jointly updating the parameters of our generative and posterior models using gradient ascent with ADAM Kingma and Ba [2014] with a learning rate of . The model was trained until a smoothed version of the ELBO reached convergence.
For the posterior on we used a threelayer neural network with 25 hidden units for each component of the mean and a network of the same structure for each nonzero entry of the Cholesky factor of the covariance as in Archer et al. [2015]. We achieved best performance by setting and including a regularization penalty for . Finally, to deal with nonidentifiability in the goal states arising from the tanh function applied to control, we incorporated a hinge loss on goals located outside the visible screen.
For comparison, we also fit our data using a Deep Latent Gaussian Model (DLGM) [Rezende et al., 2014] as a model of
and a Variational Recurrent Neural Network (VRNN)
[Chung et al., 2015] as a model of the empirical time series. Our DLGM was also trained jointly with the control model using a Variational Bayes procedure. The generative model used 3 layers, with 25 hidden units, while the posterior model for the mean and variance of each generative layer had 2 layers and 25 hidden units. Rectified linear units were used as the nonlinearity for all hidden layers. The VRNN used
code provided by the authors, which was modified to work for our data. These modifications included reducing the number of latent variables and hidden units, as well as allowing the model to take in extra inputs that it doesn’t need to predict (i.e. we used velocity as an additional input, even though we only care about predicting position). Here, we fit the trajectories () directly, as opposed to the DLGM, which followed the controlgoal (, ) formulation of the GMM.5.2 Results
Our model is successfully able to capture the wide variability of trajectories exhibited by player data (Figure 2). The resemblance of our model’s generated data to real data, combined with the structured interpretibility of the model, suggests that our assumptions are at least sufficient as an “as if” explanation of dynamics. Moreover, the generative nature of the model allows efficient calculation or approximation of quantities like momentary surprise or entropy of the goal mixture that are of interest as possible neural correlates. For comparison, we also present similar numbers of trials generated from the Deep Latent Gaussian Model (DLGM) Rezende et al. [2014]
, a form of probabilistic variational autoencoder, and the Variational Recurrent Neural Network (VRNN) of
Chung et al. [2015]. Clearly, both the DLGM (Figure 1(c)) and the VRNN (Figure 1(d)) suffer from mode collapse, while our GMMbased model (Figure 1(b)) reproduces all the major trajectory motifs of the real data (Figure 1(a)), while maintaining an interpretable structure. We ascribe the failure of the DLGM and VRNN in capturing the variability in our data to the strong multimodality of trajectories at strategic decision points in the trial. In our experiments, despite initially producing a variety of trajectories both models eventually succumbed to the mode collapse exhibited in (Figures 1(c) and 1(d)).Based on the raw data (Figure 1(a)), it is clear that the players in our task had very distinct goals at the beginning of each trial. In order to demonstrate the effect of initial goal state on trial outcome, as well as gain insight into the interaction between agents in our model, we generated trials with specified initial goal states. These can be seen in Figure 3. Red Xs indicate the goalie’s initial goal, and blue Xs indicate the ball’s initial goal. Towards the beginning of the trial, the agents both strongly follow their initial goal, but as the trial progresses, the opponents’ trajectories begin to influence one another. In the case of 2(a), the goalie (not shown), having initially guessed incorrectly, quickly turns around. At this point, the ball reaches a decision point, with some subsequent trajectories bending upward and some going down. In the case of 2(b), however, the ball diverges from its initial goal more quickly because the goalie guesses correctly. At this point, the ball either tries to feint up and go down, or, less commonly, sharply turn up. These types of behaviors demonstrate the ability of our model to accurately capture the interaction inherent in the task, as opposed to simply memorizing trajectories.
5.3 Animations
In order to visualize the dynamics of our model in real time, we generated animations of individual trials (both real and generated from our model) with different visualizations.

Animation 1: A real trial, played by monkeys. (link)

Animation 2: The same real trial, with goal states inferred by our recognition model at each time point plotted with empty colored shapes corresponding to their agent. (link)

Animation 3: A trial generated from our model. (link)

Animation 4: The same generated trial, with corresponding goal states from our generative model. (link)
6 Related Work
There is a long history of investigating behavior in neuroscience from the viewpoint of control models, particularly for eye movements [Carpenter, 1988] and motor coordination [Todorov and Jordan, 2002]. Models based on recurrent neural networks [Sussillo and Abbott, 2009, Song et al., 2016] have been successful at generating patterns of activity described as multivariate time series, as have, more recently, spiking models [Thalmeier et al., 2016, Abbott et al., 2016]. However, the focus in the first case has been on the control process itself (much simplified in our formulation) and only secondarily on intentions, while in the second class of models, the focus has been on replicating output from a tutor signal rather than reproducing highly irregular real behavior. In this way, our work is closer in spirit to recent efforts to capture natural behaviors with simplified empirical models [Berman et al., 2016, Hong et al., 2015] that are more interpretable than pure blackbox formulations.
Moreover, by formulating our model as a generative process, we receive the added benefit that training produces an artificial agent that replicates the tendencies and biases of real players, a key advantage for studies of interactive play. By adding player identity as the input to such a model, we should be able to successfully capture and reproduce a variety of play styles. Finally, we can consider replacing our mixture model by a more structured temporal model like a hierarchical hidden Markov model, in which case our player can simply be viewed as a sequence of discrete (hierarchial) strategic choices with a complex observation model parameterized by a neural network. Our mixture model would then be equivalent to integrating out this HMM layer.
7 Conclusion
We have proposed a model of inverse reinforcement learning based on inferring latent trajectories for goal states through time. These goal states give rise to observed states via a control model, and their evolution is governed by a Gaussian process determined by a dynamic value function. The model combines two generative approaches: a variational autoencoder (for inferring posteriors over goals given observations) and a gaussian mixture model (for approximating ).
Our model provides a structured method for understanding behavior in a dynamic, continuouscontrol task and allows for behavioral task complexity to scale up with the complexity of neural data. We have shown that such a model is able to reproduce the noisy and heterogenous behavior of real agents engaged in a competitive video game, and that by modeling the value function explicitly, we are able to easily visualize the multimodal distribution over future strategies for any instantaneous configuration of the system. This constitutes a significant improvement over conventional models in psychology, animal behavior, and related fields that seek to model the decisions of agents in dynamic environments, as it dramatically increases model flexibility while retaining the interpretability of a cost function over latent goals. For our example data, the inferred goal states, as well as their associated value functions, give important clues as to the underlying strategies employed by players, as well as offering potential neural correlates of the decision process.
Acknowledgments
We would like to thank Michael Platt for sharing the data used in this work and Caroline Drucker and David Carlson for helpful discussions. This work was funded by NIH grants R01MH109728 (PI: Platt; JP CoInvestigator) and a BD2K career development award (K01ES025442) to JP.
References
 Pearson et al. [2014] John M Pearson, Karli K Watson, and Michael L Platt. Decision making: the neuroethological turn. Neuron, 82(5):950–965, 2014.
 Freeman et al. [2014] Jeremy Freeman, Nikita Vladimirov, Takashi Kawashima, Yu Mu, Nicholas J Sofroniew, Davis V Bennett, Joshua Rosen, ChaoTsung Yang, Loren L Looger, and Misha B Ahrens. Mapping brain activity at scale with cluster computing. Nature methods, 11(9):941–950, 2014.
 Pnevmatikakis et al. [2016] Eftychios A Pnevmatikakis, Daniel Soudry, Yuanjun Gao, Timothy A Machado, Josh Merel, David Pfau, Thomas Reardon, Yu Mu, Clay Lacefield, Weijian Yang, et al. Simultaneous denoising, deconvolution, and demixing of calcium imaging data. Neuron, 89(2):285–299, 2016.
 Pandarinath et al. [2017] Chethan Pandarinath, Daniel J O’Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky, Jonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg, et al. Inferring singletrial neural population dynamics using sequential autoencoders. bioRxiv, page 152884, 2017.
 Zheng et al. [2016] Stephan Zheng, Yisong Yue, and Jennifer Hobbs. Generating longterm trajectories using deep hierarchical networks. In Advances in Neural Information Processing Systems, pages 1543–1551, 2016.
 Moussaïd et al. [2009] Mehdi Moussaïd, Dirk Helbing, Simon Garnier, Anders Johansson, Maud Combe, and Guy Theraulaz. Experimental study of the behavioural mechanisms underlying selforganization in human crowds. Proceedings of the Royal Society of London B: Biological Sciences, 276(1668):2755–2762, 2009.
 Ng et al. [2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 Dvijotham and Todorov [2010] Krishnamurthy Dvijotham and Emanuel Todorov. Inverse optimal control with linearlysolvable MDPs. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 335–342. machinelearning.wustl.edu, 2010.

Park and Sandberg [1991]
Jooyoung Park and Irwin W Sandberg.
Universal approximation using radialbasisfunction networks.
Neural computation, 3(2):246–257, 1991.  Beal [2003] Matthew James Beal. Variational algorithms for approximate Bayesian inference. University of London United Kingdom, 2003.
 Wainwright et al. [2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS, pages 814–822, 2014.
 Kucukelbir et al. [2015] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic variational inference in stan. In Advances in neural information processing systems, pages 568–576, 2015.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. AutoEncoding variational bayes. 20 December 2013.
 Archer et al. [2015] Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models. 23 November 2015.
 Gao et al. [2016] Yuanjun Gao, Evan Archer, Liam Paninski, and John P Cunningham. Linear dynamical neural population models through nonlinear embeddings. 26 May 2016.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 22 December 2014.
 Chung et al. [2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. 7 June 2015.
 Carpenter [1988] Roger HS Carpenter. Movements of the Eyes, 2nd Rev. Pion Limited, 1988.
 Todorov and Jordan [2002] Emanuel Todorov and Michael I Jordan. Optimal feedback control as a theory of motor coordination. Nature neuroscience, 5(11):1226–1235, 2002.
 Sussillo and Abbott [2009] David Sussillo and Larry F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009.
 Song et al. [2016] H Francis Song, Guangyu R Yang, and XiaoJing Wang. Training excitatoryinhibitory recurrent neural networks for cognitive tasks: A simple and flexible framework. PLoS computational biology, 12(2):e1004792, 2016.
 Thalmeier et al. [2016] Dominik Thalmeier, Marvin Uhlmann, Hilbert J Kappen, and RaoulMartin Memmesheimer. Learning universal computations with spikes. PLoS computational biology, 12(6):e1004895, 2016.
 Abbott et al. [2016] LF Abbott, Brian DePasquale, and RaoulMartin Memmesheimer. Building functional networks of spiking model neurons. Nature neuroscience, 19(3):350–355, 2016.
 Berman et al. [2016] Gordon J Berman, William Bialek, and Joshua W Shaevitz. Predictability and hierarchy in drosophila behavior. Proceedings of the National Academy of Sciences, 113(42):11943–11948, 2016.
 Hong et al. [2015] Weizhe Hong, Ann Kennedy, Xavier P BurgosArtizzu, Moriel Zelikowsky, Santiago G Navonne, Pietro Perona, and David J Anderson. Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning. Proceedings of the National Academy of Sciences, 112(38):E5351–E5360, 2015.
Comments
There are no comments yet.