I Introduction
We are interested in an active perception problem, where an autonomous agent interacts with a second agent (which we refer to as the opponent in the following context), whose intention is unknown. The objective is evidence accumulation to help identify the intent of the opponent. In order to achieve this goal while ensuring safety, the autonomous agent has to reason about the possible reactions of the opponent in response to its exploring actions and maximize the information gained from this interaction. This type of problem can find applications in various domains such as urban security and humanitarian assistance.
One field of research closely related to our work is Threat Assessment (TA). One approach to TA is Adversarial Intention Recognition (AIR), i.e., recognizing the intentions of a potential adversary based on the observations of its actions and states. Early works in AIR rely on a library of adversarial plans [1], [2], explicitly encoding a set of expected adversarial behaviors. This approach could suffer from incompleteness of the behavior library. Most recent advancements in AIR use a twophase approach combining generative plan recognition and game theoretic planning [3], [4], [5]
. In the first phase, a probability distribution over the potential intentions of the adversary is inferred by solving an inverse planning problem given an agent action model and a set of possible intentions. In the second phase, a set of stochastic games, each corresponding to one specific hostile intention, are solved, and the Nash Equilibrium policy is obtained as the best response to the adversary. This existing framework, however, does not account for adversarial strategies that are actively against the intention recognition, e.g., deceptive behaviors.
Our active perception problem differs from the AIR problem in that: 1) the opponent is not necessarily adversarial, but could be selfinterested; and 2) the primary objective is discriminating potential threat through evidence accumulation rather than defending against adversarial attacks. Despite these subtle differences, these two problems share the same challenge of reasoning about the hidden intention (partial observability) of the potential adversary subject to the modeling uncertainty of the opponent behavior, which is illustrated in Fig. 1. Therefore, we expect that the techniques in our approach are also applicable to AIR problems.
Belief space planning provides a principled framework to perform optimally in a partially observable world. Generative adversary modeling provides a variety of adversarial behaviors against which the autonomous agent can optimize its policy so that robustness is gained. This also avoids the requirements of domain experts and the difficulty of handcrafting a library of adversary behaviors. Reinforcement learning makes it possible to learn a policy from a blackbox simulator that represents a complicated mixture of behaviors, which could be difficult in planningbased approaches. The maximum entropy framework minimizes the exploitability of the resulted policy by minimizing the predictability of the actions, thus making it more robust to unmodeled adversarial strategies.
To summarize, the contribution here is the development of a scalable robust active perception method in scenarios where a potential adversary opponent could be actively hostile to the intent recognition activity, which extends and outperforms the POMDP methods.
Ii Related works
We review three related fields of research:

POMDP: finds a deterministic optimal policy in a partially observable scenario given a reasonably accurate agent model;

Game theory: finds robust Nash Equilibrium policies given a payoff (reward) profile encoding the preference (dictated by the intent) of the agents, without the requirement of an agent model;

Deep reinforcement learning: enables learning an optimal policy from a simulator, the building block that our work is based on.
Iia POMDP frameworks
The POMDP framework provides a principled approach to behave optimally in a partially observable environment.
One restriction of this framework is that it assumes a fixed reactive probabilistic model of the opponent, implying stationary behavior without rationality. To mitigate performance degradation due to modeling uncertainty, existing approaches include BayesianAdaptive POMDP (BAPOMDP) [6], [7], robust POMDP [8], Chanceconstrained POMDP (CCPOMDP) [9], and InteractivePOMDP (IPOMDP) [10].
BAPOMDP augments the state space with a state transition count and a state observation count variables as additional hidden states [6], [7]. It maintains a belief over the augmented state space, resulting in an optimal tradeoff between model learning and reward collecting. An implicit assumption of this approach is that the unknown POMDP model is either fixed or varying slower than the model learning process, which is unlikely to be applicable to active perception where an adversary opponent might be learning and adapting too.
An alternative approach is to find a robust policy, which does not assume a fixed model. Robust POMDP assumes the true transition and observation probability belongs to a bounded uncertainty set, and optimizes the policy for the worst case [8].
CCPOMDP finds an optimal deterministic policy that satisfies chance constraints. This formulation results in a better tradeoff between robustness and nominal performance than Robust POMDP. While this formulation shows promising results in challenging risksensitive applications[9], we argue that the class of uncertainty considered in the CCPOMDP formulation might be too restrictive for our application. However, a better framework that utilizes a wide class of adversary model to find a stochastic optimal policy should exhibit better robustness against an adversary.
IPOMDP extends the POMDP framework by augmenting the hidden space of each agent with a type attribute. The type attribute of an agent includes its preference and belief. The agents reason about the types of the other agents, resulting in nested beliefs. The issue with this approach is that the type space is so large that inference becomes computationally intractable [10]. In practice, a finite set of possible models for each agent is assumed, which makes it suffer from the difficulty of model incompleteness.
IiB Game theoretic frameworks
The POMDP framework simplifies the active perception problem into a singleagent planning problem. An alternative multiagent view of this problem is a Bayesian game, where the autonomous agent is unsure about the identity of the opponent. In such a problem, Bayesian Nash Equilibrium strategies are the optimal solutions. Regret Minimization (RM) [11] and Fictitious Play (FP) [12]
are two algorithmic frameworks for finding Nash Equilibriums. Recent extensions of these two frameworks with deep neural networks achieved success in imperfect information zerosum games
[13], [14]. One significant advantage of this framework over POMDP is that no agent model is explicitly required. Instead, the opponent behavior is implicitly specified by the joint payoff (reward), assuming rationality. The Nash Equilibrium policy is a robust solution in the sense that it is the best response to a perfect opponent. In reality, however, the opponent could be of bounded rationality and reasoning capability, which is nontrivial to model in this game theoretic framework. Moreover, we often have a strong prior over the probable opponent behaviors, while the game theoretic approach completely neglects this knowledge. We expect that algorithms exploiting a reasonable opponent model can outperform the Nash Equilibrium in both the nominal and mildly offnominal cases. Moreover, convergence of RM and FP in generalsum twoplayer games is not established, making it difficult to apply these techniques to our active perception problem, since the opponent is not necessarily adversarial, but could be indifferent (e.g., civilian).Based on the above discussions, we anticipate a correlation between model uncertainty and the performance of different classes of algorithms, as illustrated in Fig. 2. A desired solution should be near optimal given a good model, and degrades gracefully with respect to increasing model uncertainty.
IiC Deep reinforcement learning
Deep reinforcement learning has led to several recent breakthroughs to solving difficult problems in both MDP [15], [16] and POMDP [17], [18] domains, single and multiagent [19], [20], [21]
domains. The difficulty of deep reinforcement learning in multiagent domain stems from the nonstationarity of the perceived environment dynamics due to the learning process of other agents. This nonstationarity destablizes value function based methods and causes high variance for policy based methods
[22]. A lot of recent works focus on developing learning algorithms that converge in this multiagent setting, [19], [23], [24], while few works have been done on agent modeling [25]. We argue that it is crucial to model the opponent in our active perception problem, because otherwise it becomes challenging, if not impossible, to define exploring behavior. Another benefit of maintaining a model is that the action observation history can be compactly summarized into a belief state, which retains the Markovian property even in partially observable settings. This Markovian property is crucial to the convergence of many reinforcement learning algorithms.Iii Approach
In this section, we describe our algorithm to address the perceived challenges, i.e., active perception robust to unmodeled adversarial strategies. We first formalize the problem description and list the assumptions we make. We then give a general description of our algorithm followed by the implementation detail.
We model the active perception problem as a planning problem, defined by the tuple , where is the state of the world, consisting of the set of observable states and the set of partially observable states ; is the set of actions of the autonomous agent; is the set of actions of the opponent; we further assume that regardless of the intention, the opponent has the same set of observable actions. Otherwise, an intention is easily identifiable once an action that is uniquely corresponding to that type of intention is observed. is the transition probability, where denotes the space of probability distribution over the space . is the observation probability; is the reward function;
is the prior probability of the opponent being an adversary; and
is the discount factor.We make the following modeling assumptions:

The opponent is either a civilian or an adversary with hostile intents.

A civilian opponent is selfinterested, whose behavior can be modeled by a reactive policy .

A hostile opponent is primarily goaldirected, which is defined by a known MDP.

A hostile opponent is of bounded rationality, implying that it might not be able to always take the optimal action; moreover, it is likely to behave deceptively in order to achieve its goal.
We also assume that a reasonably accurate civilian behavior model is available. We then generate a parametric set of hostile models with two parameters representing the level of rationality and the level of deception, respectively. We use a feedforward neural network (NN) to represent the policy of the autonomous agent. This NN takes a binary belief state as its input, which is obtained from Bayesian filtering the hidden intention based on an average model. The output is a stochastic policy. The reward function is composed of a belief dependent reward, which encourages exploring behavior, and a state dependent reward, which ensures safety. In order to minimize the exploitability, we apply the softQ learning
[26] algorithm that learns a maximum entropy policy.We present the detail on agent modeling, belief space reward and policy learning in the following subsections.
Iiia Opponent modeling
We use a binary variable
to denote whether the opponent is a civilian or an adversary with hostile intents. Depending on , the opponent is expected to exhibit different behaviors, which is fully described by an opponent policy . This model is restrictive since the action probability only depend on the current state. Nonetheless, we use this model only for policy learning, and use a general history dependent opponent policy for the evaluation of the learned autonomous agent policy. Another implicit assumption of this model is that the opponent has full observability over the states. This assumption could be released by modeling the opponent as a POMDP agent.Civilian model: If the opponent is a civilian, i.e. , we assume a simple reactive policy is available to model the opponent:
(1) 
Adversary model: We use the following equation to model an adversarial agent’s policy :
(2)  
(3) 
where
denotes the Kullback–Leibler divergence between two distributions, The goalachieving policy
is associated with the optimal Q function , of a goalachieving adversary MDP defined later. The temperature parameter in (3) represents the level of rationality of the adversary. The other parameter indicates the level of deception. is the partition function that normalizes .We assume that the goaldirected behavior can be modeled by a known adversary MDP , where is the state space of the active perception problem augmented by the action space of the autonomous agent. This implies that if the autonomous agent takes different actions, the opponent will be in different MDP states, even though the world space is the same, allowing the opponent taking different responses. , the action space of the adversary MDP, is the same as that of the active perception problem, and so is the transition probability . The reward function specifies the reward for the adversary MDP, which is different from that of the active perception. is the discount factor, which could be different from .The interpretation of (2) and (3) is: the adversary policy is a balance between the goalachieving actions corresponding to the adversary MDP (first term in (2)) and the deceptive actions by imitating the civilian policy (second term in (2)). By varying the two hyperparameters and , we obtain a set of policies describing a variety of adversary behaviors, which is expected to make the optimized active perception policy more robust. Therefore, we assume a uniform hyperprior over these two hyperparameter:
(4) 
IiiB Belief space reward
We maintain a belief over the hidden variable by Bayesian filtering. This requires both models for the civilian and the adversary. The civilian model is given by Eq. 1. As the adversary model includes two continuous parameters, inference over the joint space of is expensive and might not be helpful because this handcraft model may not match the real adversary behavior. We use an average model by marginalizing out the hyperparameters
(5) 
With (1) and (5), the Bayesian update rule for the belief is well defined. We define a hybrid beliefstate dependent reward to balance exploration and safety
(6)  
where we use the shorthand to denote , the belief that the opponent is an adversary; and is the state dependent reward. This reward (6) balances exploration behavior and safety. The negative entropy reward can be interpreted as maximizing the expected logarithm of true positive rate (TPR) and true negative rate (TNR). The statedependent reward is used to ensure safety. For instance, some actions could be dangerous to civilians, which are discouraged by a large negative reward.
IiiC Policy learning
We use softQ learning [26] to learn a stochastic belief space policy. The softQ learning objective is to maximize the expected reward regularized by the entropy of the policy,
(7) 
The parameter controls the ‘softness’ of the policy. The nice interpretation of this objective function is maximizing accumulative reward while behaving as uncertain as possible, which is a desired property against an adversary.
This maximum entropy problem is solved using softQ iteration. For discrete action space, the fixed point iteration:
(8) 
(9) 
converges to the optimal soft value functions and [26], and the optimal policy can be obtained from:
(10) 
IiiD Implementation detail
In order to stabilize the training of the soft value functions, a separate target value network is used, whose parameter is an exponential moving average of that of the value network, with average coefficient . During the training, the value at the right hand side of (8) is replaced by the value of the target value function.
We use two feedforward neural networks to parametrize the soft Q function and the soft value function. Each neural network has three fully connected hidden layers with 64, 128, and 64 hidden units, respectively, followed by the Relu nonlinear activation function. We use L1 loss and Adam stochastic optimizer with learning rate
for the value function training. We use a batch size of 50, and an experience replay buffer of size , withtraining epochs. We tested different entropy parameter
, and selected . This value leads to both stable training and decent performance. The pipeline of our algorithm is summarized by the pseudo code in Algorithm 1.Iv Case study: threat discrimination at a checkpoint
Iva Problem description
We evaluate our algorithm via a simple threat discrimination scenario at a checkpoint. In this scenario, an autonomous agent wants to identify if an oncoming opponent is a civilian or an adversary.
States: The states consist of the fully observable physical state: the distance of the opponent from the checkpoint , and the binary hidden state of the opponent indicating civilian or adversary .
Actions and observations: At each time instance, the autonomous agent is allowed to take one action from three possible actions: (1) Send a hand signal, (2) Use a loudspeaker, (3) Use a flare bang. The opponent has two possible reactions at each time instance: (1) Stay at the same place (2) Continue proceeding towards to checkpoint.
State transition: None of the three actions of the autonomous agent has direct effects on the opponent state, while the response of the opponent to these actions are different in the probabilistic sense. If the opponent takes the first action (stay), then its distance from the checkpoint does not change; otherwise, if the opponent takes the second action (proceed), the distance decreases by a unit distance:
(11) 
State dependent reward: The state dependent reward for the autonomous agent ( in (6)) is shown in Table II. Any actions taken upon a civilian is penalized. Aggressive actions (such as flare bang) are penalized more heavily than conservative actions (such as hand signal).
Initial and terminal conditions: Initially, the opponent is 12 unit distance away from the checkpoint, . The autonomous agent and the opponent take turns making their actions for 10 rounds, with the autonomous agent taking its action first. This interaction terminates at the 10th round, i.e., .
Opponent agent model:
Civilian: A civilian behaves reactively to the actions taken by the autonomous agent, according to the probability shown in Table II.
Adversary: The primary goal of an adversarial opponent is to get close to the checkpoint as quick as possible to conduct a malicious attack. We use the following dense reward for the adversary, which increases as the distance from the checkpoint decreases:
(12) 
We choose . The adversary policy is determined by (2), (3), and (12) together.
IvB Baseline
We compare our algorithm with a planning based baseline algorithm, CCPOMDP. In this framework, the observation probabilities of the POMDP model are assumed to be drawn from a probability distribution. The optimal policy is computed such that the corresponding value function is guaranteed to be higher than a maximized threshold with a certain (high) probability. Formally, the CCPOMDP optimal value function can be found by the iteration
(13) 
where is obtained from the chance constraint optimization problem
(14) 
where is the iteration index. is the modeled observation probability. is the probability distribution over the observation probability. is the updated belief from the belief state , after taking the action and observing the reaction . is a confidence bound, which typically takes a small value. We choose , as this results in the best performance.
Action/Reaction  Stay  Proceed 

Hand  0.60  0.40 
Loudspeaker  0.75  0.25 
Flare  0.90  0.10 
Action/State  Civilian  Adversary 

Hand  0.1  0 
Loudspeaker  0.3  0 
Flare  0.7  0 
In order to solve the chance constraint optimization, the probability distribution , over the observation probability parameters
needs to be specified. We model these probability parameters as uncorrelated Gaussian random variables with the nominal probability parameters determined by Eq.
1 (with Table II), 2, 3, 12 as their mean values. We choose the variance of these Gaussian variables to be .IvC Evaluation criterion
In order to evaluate the robustness of the active perception policy against unmodeled adversarial behaviors, we test the performance on two adversary models that have not been presented in the training phase:
Learning adversary: In the first testing adversary model, we apply the softQ learning algorithm to train an adversary policy with the following adversary reward
(15) 
where the selfinterest parameter controls the degree of selfinterest, and its inverse can be interpreted as the degree of adversary. We choose different values of in the range , with step size . We favor the adversary by rendering him the access to the internal belief state of the autonomous agent. The adversary policy network takes this belief state, the immediate action taken by the autonomous agent and the observable distance from the checkpoint, outputting a probability distribution over its two available actions. We define the following two criteria to evaluate the performance of the active perception algorithms

Accumulated active perception loss:

True positive rate (TPR):
Deceptive adversary: In the second testing adversary model, we handcraft a deceptive strategy. A deceptive adversary uses the following policy when the autonomous agent takes actions hand or loudspeaker:
(16) 
while only when the autonomous agent takes action flare, the adversary takes action proceed. is a free parameter that controls how deceptive the adversary is. We choose in the range , with step size .
These two adversary models are not present in the learning phase. The active perception agent has to identify the hidden intent while being robust to this model uncertainty, which is challenging.
IvD Results
Against a learning adversary: We present the performance criteria against a learning adversary in Fig. 2(a) and 2(b). These results are averaged over 10 learning simulations. Our algorithm shows better performance compared with the CCPOMDP baseline over the whole range of the selfinterest parameter , despite the small difference. This is likely attributed to the ensemble modeling used in our algorithm, which results in better robustness against model uncertainty. Nonetheless, planning against an ensemble model is difficult to formulate in the POMDP framework.
Against a deceptive adversary: We present the TPR against a deceptive adversary in Fig. 2(c). The deterministic CCPOMDP policy achieves significantly lower TPR than the stochastic softQ learning policy does. As the CCPOMDP policy deterministically selects action hand when the belief is lower than some threshold, selects loudspeaker otherwise, and never selects flare, the deceptive adversary can maneuver this belief state by exploiting this deterministic structure. In contrast, since the softQ learning agent uses a maximum entropy stochastic policy, the adversary’s maneuvering strategy becomes less effective.
We anticipate that the difference in performance between our algorithm and CCPOMDP is correlated with the adversary model uncertainty, as illustrated in Fig. 2. We provide some evidence to justify this speculation in Fig. 4. We defined a policy uncertainty index , as one metric to quantify the deviation of the actual adversary policy from the assumed one
(17) 
where
is an empirical estimation of the adversary policy from data tuples
collected in simulations. is the nominal adversary policy defined in Eq. 2. is the number of counts that appears in the data, where represents any value for . is the total number of data tuples; and is the total number of different data tuples appearring in simulations.Fig. 4 shows that, generally speaking, a large model uncertainty corresponds to a large difference in performance. This observation explains the small difference between our algorithm and CCPOMDP against the learning adversary shown in Fig. 2(a) and 2(b), and the large difference against the deceptive adversary shown in Fig. 2(c). It also suggests that our algorithm has the desired property illustrated in Fig. 2, which might be a better tradeoff between nominal and offnominal performance than both POMDP and Nash Equilibrium.
V Conclusion
In this work, we pose an active perception problem against an opponent with uncertain intent and potential adversarial behaviors. We reviewed related fields of research and pointed out the gap between the existing approaches and the desired solution properties. We then presented a novel solution combining generative adversary modeling, belief space planning, and maximum entropy deep reinforcement learning. Compared with a CCPOMDP baseline, the proposed algorithm is more robust to unmodeled adversarial strategies. One limitation of this work is that we still need to specify an opponent behavior model, which could be nontrivial in complicated applications. We are developing learning algorithms that learn a reasonable opponent model through selfplay, to address this limitation.
Acknowledgment
The authors want to thank Dr. Kasra Khosoussi for his insightful discussions.
References
 [1] D. AvrahamiZilberbrand and G. A. Kaminka, “Keyhole adversarial plan recognition for recognition of suspicious and anomalous behavior,” Plan, Activity, and Intent Recognition, pp. 87–121, 2014.
 [2] L. Li, D.Y. Wang, Y. Wang, and W.X. Gu, “An approach to the misleading action solving in plan recognition,” in Machine Learning and Cybernetics (ICMLC), 2012 International Conference on, vol. 4. IEEE, 2012, pp. 1285–1289.
 [3] N. Le Guillarme, A.I. Mouaddib, X. Lerouvreur, and S. Gatepaille, “A generative gametheoretic framework for adversarial plan recognition,” in 10es Journées Francophones sur la Planification, la Décision et l’Apprentissage (JFPDA 2015), 2015.

[4]
N. Le Guillarme, A.I. Mouaddib, S. Gatepaille, and A. Bellenger, “Adversarial
intention recognition as inverse gametheoretic planning for threat
assessment,” in
Tools with Artificial Intelligence (ICTAI), 2016 IEEE 28th International Conference on
. IEEE, 2016, pp. 698–705.  [5] S. Ang, H. Chan, A. X. Jiang, and W. Yeoh, “Gametheoretic goal recognition models with applications to security domains,” in International Conference on Decision and Game Theory for Security. Springer, 2017, pp. 256–272.
 [6] S. Ross, J. Pineau, B. Chaibdraa, and P. Kreitmann, “A bayesian approach for learning and planning in partially observable Markov decision processes,” Journal of Machine Learning Research, vol. 12, no. May, pp. 1729–1770, 2011.
 [7] S. Katt, F. A. Oliehoek, and C. Amato, “Learning in POMDPs with Monte Carlo tree search,” arXiv preprint arXiv:1806.05631, 2018.
 [8] T. Osogami, “Robust partially observable Markov decision process,” in International Conference on Machine Learning, 2015, pp. 106–115.
 [9] P. Santana, S. Thiébaux, and B. Williams, “Rao*: an algorithm for chance constrained pomdps,” in Proc. AAAI Conference on Artificial Intelligence, 2016.
 [10] P. J. Gmytrasiewicz and P. Doshi, “Interactive POMDPs: Properties and preliminary results,” in Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent SystemsVolume 3. IEEE Computer Society, 2004, pp. 1374–1375.
 [11] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione, “Regret minimization in games with incomplete information,” in Advances in neural information processing systems, 2008, pp. 1729–1736.
 [12] J. Heinrich, M. Lanctot, and D. Silver, “Fictitious selfplay in extensiveform games,” in International Conference on Machine Learning, 2015, pp. 805–813.
 [13] P. H. Jin, S. Levine, and K. Keutzer, “Regret minimization for partially observable deep reinforcement learning,” arXiv preprint arXiv:1710.11424, 2017.
 [14] J. Heinrich and D. Silver, “Deep reinforcement learning from selfplay in imperfectinformation games,” arXiv preprint arXiv:1603.01121, 2016.
 [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [16] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
 [17] M. Hausknecht and P. Stone, “Deep recurrent Qlearning for partially observable MDPs,” CoRR, abs/1507.06527, vol. 7, no. 1, 2015.

[18]
P. Karkus, D. Hsu, and W. S. Lee, “Qmdpnet: Deep learning for planning under partial observability,” in
Advances in Neural Information Processing Systems, 2017, pp. 4694–4704.  [19] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
 [20] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multitask multiagent RL under partial observability. arxiv preprint,” arXiv preprint arXiv:1703.06182, 2017.
 [21] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang, “Multiagent bidirectionallycoordinated nets for learning to play starcraft combat games,” arXiv preprint arXiv:1703.10069, 2017.
 [22] L. Busoniu, R. Babuška, and B. De Schutter, “Multiagent reinforcement learning: An overview,” Innovations in multiagent systems and applications1, vol. 310, pp. 183–221, 2010.
 [23] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multiagent reinforcement learning,” arXiv preprint arXiv:1702.08887, 2017.
 [24] A. Marinescu, I. Dusparic, A. Taylor, V. Cahill, and S. Clarke, “Decentralised multiagent reinforcement learning for dynamic and uncertain environments,” arXiv preprint arXiv:1409.4561, 2014.
 [25] H. He, J. BoydGraber, K. Kwok, and H. Daumé III, “Opponent modeling in deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1804–1813.
 [26] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energybased policies,” arXiv preprint arXiv:1702.08165, 2017.
Comments
There are no comments yet.