1 Introduction
An animal’s survival depends on effective planning for future costs and rewards. One of the most fundamental purposes of the brain is to create and execute such plans. However, these plans cannot be directly observed from behavior. To understand how the brain generates complex behaviors and learn how an animal builds a representation of the surrounding environment, it is valuable to construct hypotheses about the brain’s internal states that narrow the search space for neural implementations of planning. These hypotheses often come from models of the task implemented as artificial agents, whose internal state representations provided a latent space. However, differences between the model task and agent and the real task and animal create the potential for severe modelmismatch, injecting unknown biases into scientific conclusions. Here we use a latentvariable model to impute latent behavioral states based on observed behavior directly, using a datadriven latentvariable analysis that is designed to match the dependency structure of agentbased models without enforcing parametric structure.
To understand the mechanisms underlying behaviors, it is crucial to study hard tasks that involve inferring latent variables, since only then will an animal need to create a mental model of the world; otherwise the animal could perform well simply by responding to its immediate sensory input. Naturalistic foraging is one such task where an agent has to make decisions from many difficult choices in an uncertain environment. When foraging, an animal must take actions to procure rewards, and these actions have costs. How the animal schedules its actions determines the balance between total costs and rewards, Charnov Orians (2006). The animal’s goal in foraging is to use its energy resources for short term and long term sustenance. Decisions must be made continuously, and therefore time is a key ingredient in foraging: An animal benefits from tracking when reward is likely accessible at different locations. A natural way to represent such temporal quantities is in terms of dynamic event rates. For this reason, our work highlights the continuoustime aspects of decision problems.
Fig 1
illustrates our motivation for the foraging problem. An agent develops an internal model and takes an action, which may result in a reward. As a result, the agent updates its internal model in an attempt to learn the environmental dynamics. We explore the plausibility that an animal’s internal states in continuous time manifest as measurable consequences on its behavior, using a switching hidden semiMarkov model, and demonstrate the model’s applicability in inferring latent states on a foraging task.
2 Background
Behavior identification using computational models has a rich history, and clear value–the ability to learn rich representations of behavioral constituents provides important insights into underlying neural processes which can also be incorporated into the development of artificial agents (Anderson Perona (2014)). Early behaviorists explored behavioral sequences in an attempt to learn determining causal factors underlying behavior, aiming to explain effects like when an agent switches to an alternate choice. These approaches are still common in animal ecology, where hidden Markov time series models (HMMs) have been used to analyse animal’s internal states Nathan . (2008); Langrock . (2012). Macdonald Raubenheimer (1995) proposed using HMMs to capture causal structure in putative motivational states. However, they also observed that there are no onetoone correspondences between the learned states and behavior, and Zucchini . (2008) found that behavior also influences internal states through feedback, challenging the dependency structure assumed by HMMs. To capture nonstationarity in behavior, Li Bolker (2017)
use temporally varying transition probabilities to model animal movement. However, behavior identification has struggled to produce more than a description of the behavior, with unknown relationships between the elicited latent states and the animal’s representations. These failures are less surprising when it’s realized the behavior expressible by HMMs is incompatible with key characteristics of observed behavior.
In these works and others, an important question left unanswered is what kind of latent belief states could be inferred that not only represent belief dynamics but also the choices that an animal or an agent makes. We attempt to uncover latent state beliefs in a continuous time model and apply it to a complex ecological process, foraging, which has multiple underlying subprocesses including satisfaction of needs, searching for alternatives, motivation, decision making, and control. We show that by generalizing allowing actiondependent transitions and more complex temporal dynamics, we can capture the expressivity of artificial agents designed for these domains, and highly interprable representations from animal behavior.
3 Model
Ecological behavior in animals is often well characterized by quick transitions between discrete behavioral modes. These transitions are difficult to predict from external events, and instead reflect a shift of the animal’s internal state based on integrating events over a longer time scale. A process with quick transitions separated by long interevent intervals can be approximated by a discretetime hidden Markov process involving transition probabilities, but many of the probabilities (those for which the state is unchanged) will be close to one, while the remaining probabilities will be very small and decrease with the discrete time scale. Instead, we expect there will be advantages in treating these latent dynamics in continuous time, based on rates or time intervals between transitions and events.
A natural model to account for these pointlike transitions in continuous time is the semiMarkov Jump Process, Rao Teh (2013). This process is a simple but powerful class of continuoustime dynamics featuring discrete states that transition according to a generator rate matrix, producing rich and flexible timing that is potentially better matched to animal behavior. In contrast, times of transitions between states in a Markov process are exponentially distributed, which describe animal behavior poorly.
However, agents who control their environment affect transition rates through their actions, which means a single generator rate matrix is not sufficient to model behavior. An important example are Belief MDPs, which is a representation of a Partially Observable Markov Decision Process (POMDP,
Kaelbling . (1998)). POMDPs are a model for inference and control when sensory measurements provide only partial observations of the state of the world. Belief MDPs have distinct transition matrices that update beliefs differently for each action. Actiondependent transitions imply that a standard semiMarkov model with a single transition generator is not expressive enough to match actiondependent belief dynamics.To allow for actiondependent belief dynamics, we propose a switching semiMarkov (SMJP) model that matches an agent’s belief dynamics by switching its generator depending on the action : . Let be a discrete latent state, and be an generator rate matrix that can be interpreted as an instantaneous transition matrix . This generator defines a point process that jumps from state to at time according to the timedependent matrix . The process can be implemented by sequentially sampling a time from the total rate leaving state , followed by sampling a new destination state according to the matrix evaluated at this sample time (Gillespie’s algorithm). An analogous process occurs for the generation of observable events , through the emission generator matrix . The resulting process is similar to a simple Markov process, except that the time between transitions is stochastic and depends on the starting state (but not the end state), illustrated in Fig 3; the animal’s behaviors and decision making are continuous, albeit partially observable only at discrete recording times.
The Markov Jump Process extends discrete time Markov processes in continuous time. Rao Teh (2013)
introduced Markov chain sampling methods that simplify structures by introducing auxiliary variables. We adapt jump structures to provide a continuoustime representation for the free foraging task and the trajectory is introduced using a generator matrix. Let
be the generator matrix, which is skew symmetric and negative diagonal entries. We can represent
as continuoustime transition matrix given by , as discrete time transition matrix that is induced by uniformization, and as observation matrix .Uniformization instantiates the Markov Jump Process as a sequence of discrete time transition matrices, by introducing a latent sequence of random times that are adapted to the process generator but occur at a rate
. For each interval, a random discretization vector of sampled times is
, and we impute sampled times for a trajectory. Using this notation, we sample both random times as a Poisson process with intensity and states using the generator matrix. The hidden Markov model characterizes a sample path of a piecewise constant stochastic process over these sampled and event times as where is now an ordered union of event times and randomly sampled discretized times. The chain can jump from a state to the same state or any other state, while the emissions are observed only at certain specified times. Since we sample intervals with these virtual jump times, the constructed process represents the same chain.To learn the discrete time transition matrix and emission matrix , we consider an ensemble of sample sequence of observed emissions as generated from an HMM, and update the matrices using an EM algorithm to best account for the available observations. However, if we sample discrete times once, the estimates would be biased, so we resample latent trajectories repeatedly and randomly based on uniformization. The learned matrix is then used to update the generator matrix using the relation while preserving its structure, and the random times are resampled to adapt to the modified . The resulting algorithm exploits uniformization to enable learning the generator via an EMalgorithm, which is orders of magnitude more efficient than Gibbs sampling.
Belief MDPs are a convenient representation for POMDPs that treats current beliefs (posterior probabilities) over partially observable world states as fully observable. Agents following a Belief MDP exhibit transitions between beliefs
, take actions according to a policy , and expect observations according to their beliefs via (Fig 3). The switching SMJP model matches the agent’s actiondependent belief dynamics by switching its generator conditional on the action : . To infer the agent’s model from experimental observations, we develop an EM algorithm to infer it’s parameters. When applied to our switching model, the forward , backward and update equations of hidden Markov model, Rabiner (1989), can be written as:(1) 
(2) 
where are the action switching indices at time t and t+1 respectively. We adjust the model parameters to maximize the probability of the observation sequence given the model and train using EM. Updates are made using the variable, which is the probability of being in state at time and state at time , and is given as
(3) 
The usual semiMarkov model is a special case of the switching semiMarkov model where the generator remains the same without action dependent switching. Our model is a switching model that changes rate, transition and emission matrices in accordance with the action taken by the agent.
We learn the model using an EM approach , updating model parameters given transition times sampled by the uniformization process, and resampling the transitions given the new model parameters. Overview of the algorithm is shown in Fig 4. The inbetween times are sampled via trajectory in latent space, providing us a continuous time series that is then used for learning via Hidden Markov model. Upon learning the emission and transition matrices for a sampled set, we use scaling factor and make an update to the rate matrix while preserving its structure, resample to get a new continuous time sequence and learn emission, transition matrices again. This process is followed until the loglikelihood on held out data stops changing within a small tolerance.
4 Experiment
We perform three experiments. We use the simulated toy data both to estimate a required training size and to ensure that the switching model is able to learn latent states, establish correspondence between partially observable Markov decision process belief states with SMJP latent states using theoretical optimal agent model and, then, apply our method to a real agent in a free foraging task. The number of states were selected by estimating the value at which the loglikelihood on the validation set stops improving.
4.0.1 Simulated toy data
To create a toy test data generated by the assumed model, we set up two transition matrices and one emission matrix with 5 states, 2 emissions and observation dependent actions. The expected size of the output sequence is set to 5000. Initial action is selected randomly and based on the action index, a transition matrix is selected. Thus, the selected transition matrix and emission matrix combination is used to estimate state transition and generate an emission. The observations, times and actions are added to the output sequence and the observation dependent action value is updated to get new observations. The simulated toy data sequence is used as a basic check if the SMJP model can learn and explain the observations. We fit switching SMJP model to the simulated data and observe that the loglikelihood starts stabilizing as it reaches the true number of states. It means that the model is able to explain the test data with an equivalent number of latent states. Therefore, we pursue a similar procedure to estimate the required number of latent states for both the optimal agent and the real agent.
4.0.2 Optimal agent
To test our SMJP model we fit it on an optimal agent performing a foraging task. We model the beliefs of an ideal observer in this task using a POMDP. There is a onetoone correspondence between a POMDP over partially observable world states and a fully observed Belief MDP in which the ‘state’ is the ‘belief’ or posterior distribution over the world state .We solve this optimal actor problem using a Belief MDP on a discretized belief space. The agent keeps track of its belief state about the world following transition dynamics , where is the new belief state, is the current state, and is an action. The agent’s sensory information depends on the world state according to the probability . Upon taking action , the agent receives immediate reward . The goal of the agent is to maximize the longterm expected reward . Our model agent achieves this goal using a policy that solves for its policy by value iteration on the discretized belief states.
The beliefs serve as latent states which control the agent’s behaviors, and give its actions a nonexponential interval distribution, which is recapitulated by the fitted switching SMJP. We find that the likelihood of the observed data is maximal for a number of states that is smaller than the true size of the underlying POMDP belief space, indicating that the semiMarkov process is able to compress the agent’s dynamics into a smaller effective number of latent states. To validate the semiMarkov model in our foraging task, we discover the latent states of the artificial agent for whom we know the ground truth. We model this agent as a nearoptimal actor that maximizes reward given partial observations of the true process. This agent maintains beliefs about the availability of food at different locations. Our agent is suboptimal because we do not store the beliefs with arbitrary precision, but rather discretize the beliefs to a finite resolution, and allow some diffusion between those belief states.
4.0.3 Application to the freeforaging task
We apply the SMJP model to infer latent states of agents performing a simple foraging task. We applied the model to both theoretical agents with nearoptimal behavior, and real agents (macaques) whose behavior we measured experimentally. In this task, two boxes contained rewards that became available after random exponentiallydistributed time intervals. If an agent presses a lever on one box when the food is available, that reward is released and that box timer is reset. The benefit of the reward is offset by two action costs: pressing the lever, and switching boxes. The state of the box is not observable, so the agent must choose its action based on an internal belief about the box, with the presumed goal of maximizing total reward minus costs. This internal belief constitutes a latent state that we infer using the semiMarkov process, both from the artificial agent and behaving monkeys.
We applied the SMJP model to infer latent states of macaques performing a simple twobox foraging task. The animal freely moved between two feeding boxes with levers that released food after an exponentiallydistributed random time interval (mean of 10 or 30 sec) had passed. The model observations were lever pressing, reward delivery, and location within the box (Fig 6a). Actions were: stay, move, or press either lever. The monkey’s movements were tracked using overhead video, and quantized by means into different locations. The number of latent states is estimated by the loglikelihood maximization (Fig 6b). The resultant process constructs the monkey’s latent states to explain the nonexponentiallydistributed intervals between lever presses (Fig 6).
5 Results and Discussion
5.0.1 Optimal agent
We trained the SMJP on an observation sequence generated by the optimal agent, and optimized the number of SMJP latent states by maximizing the loglikelihood of heldout data. While the Belief MDP agent’s relevant states (including location, reward, and beliefs ) should be implicitly embedded in the SMJP latent states , these two state representations are not immediately comparable.
To establish a correspondence, we compute the joint distribution over
and at any one time point using the shared time series of observations: . This joint distribution shows which SMJP and POMDP states tend to occur at the same time. It therefore provides a dictionary for translating the interpretable POMDP states into our learned and unlabeled SMJP states.To increase interpretability, we cluster using information theoretic coclustering, Dhillon . (2003), which provides a principled coarsegraining of the states with improved semantic interpretability. We determine the required numbers of SMJP and POMDP coclusters by finding minimum information loss in information theoretic coclustering. Fig 5b shows that latent SMJP states are associated with different belief states. Coclustering also reveals that the SMJP latent states have dynamics that match the belief dynamics (not shown). These results demonstrate that the switching SMJP model can capture latent belief states and dynamics for behavioral data.
5.0.2 Real agent
The SMJP model constructs latent states and dynamics using the real agent’s observations to predict choices and timing, including the nonexponentiallydistributed intervals between lever presses. Fig 6c shows states extracted for the action ’stay’. Beliefs precede an action and the extracted states reflect beliefs for the next action. For example, being in states are rewarding to the monkey. States that can be interpreted as ‘expectant waiting for reward’ are highlighted (Fig 6c): these states form a selfexciting delay network that is activated from other rewarded belief states. Moreover, the lower entropy of latent states associated with box 1 revealed guarding behavior we identified from video. Overall, the model network encodes a set of complex but interpretable dynamics of the animal’s beliefs and reward expectations which emphasize the complex computations underlying the decision making process.
Each transition matrix acts like an action operator and the real agent performs operations in sequences. So, we examine joint operators , where and are operators for actions and respectively. We use an offtheshelf package using, Brandes . (2008) to extract subgraphs and then persistent subspaces from all the six joint operators corresponding to different action pairs. Fig 6d shows subgraphs for two joint operators of interest (involving actions: lever press and stay). The latent states (within subspaces and ) appearing in the same subgraphs of the joint operators illustrate the real agent’s persistent reward belief states. The states outside the subspaces and correspond to other beliefs, for example, switching. These results demonstrate that the presented model is able to extract subtleties, albeit complex, in the belief states and their dynamics. The extracted latent states and dynamics will be useful regressors for finding neural correlates of the computations underlying the monkey’s behavioral dynamics.
6 Conclusion
We presented a continuoustime switching semiMarkov model that learns the latent states dynamics in conformance with the belief structure of a partially observable Markov decision process. The revealed latent states are capable of inferring complex animal behavior and its belief dynamics in naturalistic tasks like foraging. Several aspects of the inferred behaviors and belief dynamics were examined to reveal that indeed, the internal latent structural representation match the agent’s belief structure. The datadriven switching semiMarkov model provides useful estimates of the structure of the internal latent states for hard tasks. The latent states from this behavioral model could potentially be used to understand correspondences between neural activity and the latent belief dynamics that govern how an animal selects actions.
Acknowledgments
The authors thank Dora Angelaki, Valentin Dragoi, Neda Sahidi and Russell Milton for useful discussions. AK, ZW, XP and PS were supported by BRAIN Initiative grant NIH 5U01NS094368.
References
 Anderson Perona (2014) anderson2014towardAnderson, DJ. Perona, P. 2014. Toward a science of computational ethology Toward a science of computational ethology. Neuron84118–31.
 Brandes . (2008) brandes2008modularityBrandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z. Wagner, D. 2008. On modularity clustering On modularity clustering. IEEE transactions on knowledge and data engineering202172–188.
 Charnov Orians (2006) charnov2006optimalCharnov, E. Orians, GH. 2006. Optimal foraging: some theoretical explorations Optimal foraging: some theoretical explorations.
 Dhillon . (2003) dhillon2003informationDhillon, IS., Mallela, S. Modha, DS. 2003. Informationtheoretic coclustering Informationtheoretic coclustering. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining Proceedings of the ninth acm sigkdd international conference on knowledge discovery and data mining ( 89–98).
 Kaelbling . (1998) kaelbling1998planningKaelbling, LP., Littman, ML. Cassandra, AR. 1998. Planning and acting in partially observable stochastic domains Planning and acting in partially observable stochastic domains. Artificial intelligence101199–134.
 Langrock . (2012) langrock2012flexibleLangrock, R., King, R., Matthiopoulos, J., Thomas, L., Fortin, D. Morales, JM. 2012. Flexible and practical modeling of animal telemetry data: hidden Markov models and extensions Flexible and practical modeling of animal telemetry data: hidden markov models and extensions. Ecology93112336–2342.
 Li Bolker (2017) li2017incorporatingLi, M. Bolker, BM. 2017. Incorporating periodic variability in hidden Markov models for animal movement Incorporating periodic variability in hidden markov models for animal movement. Movement ecology511.
 Macdonald Raubenheimer (1995) macdonald1995hiddenMacdonald, IL. Raubenheimer, D. 1995. Hidden Markov models and animal behaviour Hidden markov models and animal behaviour. Biometrical Journal376701–712.
 Nathan . (2008) nathan2008movementNathan, R., Getz, WM., Revilla, E., Holyoak, M., Kadmon, R., Saltz, D. Smouse, PE. 2008. A movement ecology paradigm for unifying organismal movement research A movement ecology paradigm for unifying organismal movement research. Proceedings of the National Academy of Sciences1054919052–19059.
 Rabiner (1989) rabiner1989tutorialRabiner, LR. 1989. A tutorial on hidden Markov models and selected applications in speech recognition A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE772257–286.

Rao Teh (2013)
rao2013fastRao, V. Teh, YW.
2013.
Fast MCMC sampling for Markov jump processes and
extensions. Fast mcmc sampling for markov jump processes and
extensions.
Journal of Machine Learning Research1413295–3320.
 Zucchini . (2008) zucchini2008modelingZucchini, W., Raubenheimer, D. MacDonald, IL. 2008. Modeling time series of animal behavior by means of a latentstate model with feedback Modeling time series of animal behavior by means of a latentstate model with feedback. Biometrics643807–815.
Comments
There are no comments yet.