Active Perception in Adversarial Scenarios using Maximum Entropy Deep Reinforcement Learning

02/14/2019 ∙ by Macheng Shen, et al. ∙ MIT 0

We pose an active perception problem where an autonomous agent actively interacts with a second agent with potentially adversarial behaviors. Given the uncertainty in the intent of the other agent, the objective is to collect further evidence to help discriminate potential threats. The main technical challenges are the partial observability of the agent intent, the adversary modeling, and the corresponding uncertainty modeling. Note that an adversary agent may act to mislead the autonomous agent by using a deceptive strategy that is learned from past experiences. We propose an approach that combines belief space planning, generative adversary modeling, and maximum entropy reinforcement learning to obtain a stochastic belief space policy. By accounting for various adversarial behaviors in the simulation framework and minimizing the predictability of the autonomous agent's action, the resulting policy is more robust to unmodeled adversarial strategies. This improved robustness is empirically shown against an adversary that adapts to and exploits the autonomous agent's policy when compared with a standard Chance-Constraint Partially Observable Markov Decision Process robust approach.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We are interested in an active perception problem, where an autonomous agent interacts with a second agent (which we refer to as the opponent in the following context), whose intention is unknown. The objective is evidence accumulation to help identify the intent of the opponent. In order to achieve this goal while ensuring safety, the autonomous agent has to reason about the possible reactions of the opponent in response to its exploring actions and maximize the information gained from this interaction. This type of problem can find applications in various domains such as urban security and humanitarian assistance.

One field of research closely related to our work is Threat Assessment (TA). One approach to TA is Adversarial Intention Recognition (AIR), i.e., recognizing the intentions of a potential adversary based on the observations of its actions and states. Early works in AIR rely on a library of adversarial plans [1], [2], explicitly encoding a set of expected adversarial behaviors. This approach could suffer from incompleteness of the behavior library. Most recent advancements in AIR use a two-phase approach combining generative plan recognition and game theoretic planning [3], [4], [5]

. In the first phase, a probability distribution over the potential intentions of the adversary is inferred by solving an inverse planning problem given an agent action model and a set of possible intentions. In the second phase, a set of stochastic games, each corresponding to one specific hostile intention, are solved, and the Nash Equilibrium policy is obtained as the best response to the adversary. This existing framework, however, does not account for adversarial strategies that are actively against the intention recognition, e.g., deceptive behaviors.

Fig. 1: Illustration of the active perception challenge: The autonomous agent wants to infer the intention of the opponent, while accounting for its possible deceptive behaviors.

Our active perception problem differs from the AIR problem in that: 1) the opponent is not necessarily adversarial, but could be self-interested; and 2) the primary objective is discriminating potential threat through evidence accumulation rather than defending against adversarial attacks. Despite these subtle differences, these two problems share the same challenge of reasoning about the hidden intention (partial observability) of the potential adversary subject to the modeling uncertainty of the opponent behavior, which is illustrated in Fig. 1. Therefore, we expect that the techniques in our approach are also applicable to AIR problems.

Belief space planning provides a principled framework to perform optimally in a partially observable world. Generative adversary modeling provides a variety of adversarial behaviors against which the autonomous agent can optimize its policy so that robustness is gained. This also avoids the requirements of domain experts and the difficulty of handcrafting a library of adversary behaviors. Reinforcement learning makes it possible to learn a policy from a black-box simulator that represents a complicated mixture of behaviors, which could be difficult in planning-based approaches. The maximum entropy framework minimizes the exploitability of the resulted policy by minimizing the predictability of the actions, thus making it more robust to un-modeled adversarial strategies.

To summarize, the contribution here is the development of a scalable robust active perception method in scenarios where a potential adversary opponent could be actively hostile to the intent recognition activity, which extends and outperforms the POMDP methods.

Ii Related works

We review three related fields of research:

  1. POMDP: finds a deterministic optimal policy in a partially observable scenario given a reasonably accurate agent model;

  2. Game theory: finds robust Nash Equilibrium policies given a payoff (reward) profile encoding the preference (dictated by the intent) of the agents, without the requirement of an agent model;

  3. Deep reinforcement learning: enables learning an optimal policy from a simulator, the building block that our work is based on.

Ii-a POMDP frameworks

The POMDP framework provides a principled approach to behave optimally in a partially observable environment. One restriction of this framework is that it assumes a fixed reactive probabilistic model of the opponent, implying stationary behavior without rationality. To mitigate performance degradation due to modeling uncertainty, existing approaches include Bayesian-Adaptive POMDP (BA-POMDP) [6], [7], robust POMDP [8], Chance-constrained POMDP (CC-POMDP) [9], and Interactive-POMDP (I-POMDP) [10].
BA-POMDP augments the state space with a state transition count and a state observation count variables as additional hidden states [6], [7]. It maintains a belief over the augmented state space, resulting in an optimal trade-off between model learning and reward collecting. An implicit assumption of this approach is that the unknown POMDP model is either fixed or varying slower than the model learning process, which is unlikely to be applicable to active perception where an adversary opponent might be learning and adapting too.

An alternative approach is to find a robust policy, which does not assume a fixed model. Robust POMDP assumes the true transition and observation probability belongs to a bounded uncertainty set, and optimizes the policy for the worst case [8].

CC-POMDP finds an optimal deterministic policy that satisfies chance constraints. This formulation results in a better trade-off between robustness and nominal performance than Robust POMDP. While this formulation shows promising results in challenging risk-sensitive applications[9], we argue that the class of uncertainty considered in the CC-POMDP formulation might be too restrictive for our application. However, a better framework that utilizes a wide class of adversary model to find a stochastic optimal policy should exhibit better robustness against an adversary.

I-POMDP extends the POMDP framework by augmenting the hidden space of each agent with a type attribute. The type attribute of an agent includes its preference and belief. The agents reason about the types of the other agents, resulting in nested beliefs. The issue with this approach is that the type space is so large that inference becomes computationally intractable [10]. In practice, a finite set of possible models for each agent is assumed, which makes it suffer from the difficulty of model incompleteness.

Ii-B Game theoretic frameworks

The POMDP framework simplifies the active perception problem into a single-agent planning problem. An alternative multi-agent view of this problem is a Bayesian game, where the autonomous agent is unsure about the identity of the opponent. In such a problem, Bayesian Nash Equilibrium strategies are the optimal solutions. Regret Minimization (RM) [11] and Fictitious Play (FP) [12]

are two algorithmic frameworks for finding Nash Equilibriums. Recent extensions of these two frameworks with deep neural networks achieved success in imperfect information zero-sum games

[13], [14]. One significant advantage of this framework over POMDP is that no agent model is explicitly required. Instead, the opponent behavior is implicitly specified by the joint payoff (reward), assuming rationality. The Nash Equilibrium policy is a robust solution in the sense that it is the best response to a perfect opponent. In reality, however, the opponent could be of bounded rationality and reasoning capability, which is non-trivial to model in this game theoretic framework. Moreover, we often have a strong prior over the probable opponent behaviors, while the game theoretic approach completely neglects this knowledge. We expect that algorithms exploiting a reasonable opponent model can outperform the Nash Equilibrium in both the nominal and mildly off-nominal cases. Moreover, convergence of RM and FP in general-sum two-player games is not established, making it difficult to apply these techniques to our active perception problem, since the opponent is not necessarily adversarial, but could be indifferent (e.g., civilian).

Based on the above discussions, we anticipate a correlation between model uncertainty and the performance of different classes of algorithms, as illustrated in Fig. 2. A desired solution should be near optimal given a good model, and degrades gracefully with respect to increasing model uncertainty.

Fig. 2: Anticipated correlation between model uncertainty and performances of different active perception algorithms (some evidence shown in Fig. 4). While POMDP is optimal given an accurate model, it might be sensitive to model uncertainty. Nash Equilibrium should be robust to model uncertainty, but it might sacrifice too much nominal performance when a good model is available.

Ii-C Deep reinforcement learning

Deep reinforcement learning has led to several recent breakthroughs to solving difficult problems in both MDP [15], [16] and POMDP [17], [18] domains, single and multi-agent [19], [20], [21]

domains. The difficulty of deep reinforcement learning in multi-agent domain stems from the non-stationarity of the perceived environment dynamics due to the learning process of other agents. This non-stationarity destablizes value function based methods and causes high variance for policy based methods

[22]. A lot of recent works focus on developing learning algorithms that converge in this multi-agent setting, [19], [23], [24], while few works have been done on agent modeling [25]. We argue that it is crucial to model the opponent in our active perception problem, because otherwise it becomes challenging, if not impossible, to define exploring behavior. Another benefit of maintaining a model is that the action observation history can be compactly summarized into a belief state, which retains the Markovian property even in partially observable settings. This Markovian property is crucial to the convergence of many reinforcement learning algorithms.

Iii Approach

In this section, we describe our algorithm to address the perceived challenges, i.e., active perception robust to un-modeled adversarial strategies. We first formalize the problem description and list the assumptions we make. We then give a general description of our algorithm followed by the implementation detail.
We model the active perception problem as a planning problem, defined by the tuple , where is the state of the world, consisting of the set of observable states and the set of partially observable states ; is the set of actions of the autonomous agent; is the set of actions of the opponent; we further assume that regardless of the intention, the opponent has the same set of observable actions. Otherwise, an intention is easily identifiable once an action that is uniquely corresponding to that type of intention is observed. is the transition probability, where denotes the space of probability distribution over the space . is the observation probability; is the reward function;

is the prior probability of the opponent being an adversary; and

is the discount factor.
We make the following modeling assumptions:

  1. The opponent is either a civilian or an adversary with hostile intents.

  2. A civilian opponent is self-interested, whose behavior can be modeled by a reactive policy .

  3. A hostile opponent is primarily goal-directed, which is defined by a known MDP.

  4. A hostile opponent is of bounded rationality, implying that it might not be able to always take the optimal action; moreover, it is likely to behave deceptively in order to achieve its goal.

We also assume that a reasonably accurate civilian behavior model is available. We then generate a parametric set of hostile models with two parameters representing the level of rationality and the level of deception, respectively. We use a feed-forward neural network (NN) to represent the policy of the autonomous agent. This NN takes a binary belief state as its input, which is obtained from Bayesian filtering the hidden intention based on an average model. The output is a stochastic policy. The reward function is composed of a belief dependent reward, which encourages exploring behavior, and a state dependent reward, which ensures safety. In order to minimize the exploitability, we apply the soft-Q learning

[26] algorithm that learns a maximum entropy policy.

We present the detail on agent modeling, belief space reward and policy learning in the following subsections.

Iii-a Opponent modeling

We use a binary variable

to denote whether the opponent is a civilian or an adversary with hostile intents. Depending on , the opponent is expected to exhibit different behaviors, which is fully described by an opponent policy . This model is restrictive since the action probability only depend on the current state. Nonetheless, we use this model only for policy learning, and use a general history dependent opponent policy for the evaluation of the learned autonomous agent policy. Another implicit assumption of this model is that the opponent has full observability over the states. This assumption could be released by modeling the opponent as a POMDP agent.

Civilian model: If the opponent is a civilian, i.e. , we assume a simple reactive policy is available to model the opponent:


Adversary model: We use the following equation to model an adversarial agent’s policy :



denotes the Kullback–Leibler divergence between two distributions, The goal-achieving policy

is associated with the optimal Q function , of a goal-achieving adversary MDP defined later. The temperature parameter in (3) represents the level of rationality of the adversary. The other parameter indicates the level of deception. is the partition function that normalizes .

We assume that the goal-directed behavior can be modeled by a known adversary MDP , where is the state space of the active perception problem augmented by the action space of the autonomous agent. This implies that if the autonomous agent takes different actions, the opponent will be in different MDP states, even though the world space is the same, allowing the opponent taking different responses. , the action space of the adversary MDP, is the same as that of the active perception problem, and so is the transition probability . The reward function specifies the reward for the adversary MDP, which is different from that of the active perception. is the discount factor, which could be different from .The interpretation of (2) and (3) is: the adversary policy is a balance between the goal-achieving actions corresponding to the adversary MDP (first term in (2)) and the deceptive actions by imitating the civilian policy (second term in (2)). By varying the two hyper-parameters and , we obtain a set of policies describing a variety of adversary behaviors, which is expected to make the optimized active perception policy more robust. Therefore, we assume a uniform hyper-prior over these two hyper-parameter:


Iii-B Belief space reward

We maintain a belief over the hidden variable by Bayesian filtering. This requires both models for the civilian and the adversary. The civilian model is given by Eq. 1. As the adversary model includes two continuous parameters, inference over the joint space of is expensive and might not be helpful because this handcraft model may not match the real adversary behavior. We use an average model by marginalizing out the hyper-parameters


With (1) and (5), the Bayesian update rule for the belief is well defined. We define a hybrid belief-state dependent reward to balance exploration and safety


where we use the shorthand to denote , the belief that the opponent is an adversary; and is the state dependent reward. This reward (6) balances exploration behavior and safety. The negative entropy reward can be interpreted as maximizing the expected logarithm of true positive rate (TPR) and true negative rate (TNR). The state-dependent reward is used to ensure safety. For instance, some actions could be dangerous to civilians, which are discouraged by a large negative reward.

Iii-C Policy learning

We use soft-Q learning [26] to learn a stochastic belief space policy. The soft-Q learning objective is to maximize the expected reward regularized by the entropy of the policy,


The parameter controls the ‘softness’ of the policy. The nice interpretation of this objective function is maximizing accumulative reward while behaving as uncertain as possible, which is a desired property against an adversary.
This maximum entropy problem is solved using soft-Q iteration. For discrete action space, the fixed point iteration:


converges to the optimal soft value functions and [26], and the optimal policy can be obtained from:


Iii-D Implementation detail

In order to stabilize the training of the soft value functions, a separate target value network is used, whose parameter is an exponential moving average of that of the value network, with average coefficient . During the training, the value at the right hand side of (8) is replaced by the value of the target value function.

We use two feedforward neural networks to parametrize the soft Q function and the soft value function. Each neural network has three fully connected hidden layers with 64, 128, and 64 hidden units, respectively, followed by the Relu nonlinear activation function. We use L1 loss and Adam stochastic optimizer with learning rate

for the value function training. We use a batch size of 50, and an experience replay buffer of size , with

training epochs. We tested different entropy parameter

, and selected . This value leads to both stable training and decent performance. The pipeline of our algorithm is summarized by the pseudo code in Algorithm 1.

  =[] #initialize the experience replay buffer
   #initialize the soft Q networks
  for  do
     if  then
         #civilian policy
         #adversary hyper-parameter
        RHS of (2) #adversary policy
     end if
  end for
Algorithm 1 Soft-Q learning for active perception

Iv Case study: threat discrimination at a checkpoint

Iv-a Problem description

We evaluate our algorithm via a simple threat discrimination scenario at a checkpoint. In this scenario, an autonomous agent wants to identify if an oncoming opponent is a civilian or an adversary.

States: The states consist of the fully observable physical state: the distance of the opponent from the checkpoint , and the binary hidden state of the opponent indicating civilian or adversary .

Actions and observations: At each time instance, the autonomous agent is allowed to take one action from three possible actions: (1) Send a hand signal, (2) Use a loudspeaker, (3) Use a flare bang. The opponent has two possible reactions at each time instance: (1) Stay at the same place (2) Continue proceeding towards to checkpoint.

State transition: None of the three actions of the autonomous agent has direct effects on the opponent state, while the response of the opponent to these actions are different in the probabilistic sense. If the opponent takes the first action (stay), then its distance from the checkpoint does not change; otherwise, if the opponent takes the second action (proceed), the distance decreases by a unit distance:


State dependent reward: The state dependent reward for the autonomous agent ( in (6)) is shown in Table II. Any actions taken upon a civilian is penalized. Aggressive actions (such as flare bang) are penalized more heavily than conservative actions (such as hand signal).

Initial and terminal conditions: Initially, the opponent is 12 unit distance away from the checkpoint, . The autonomous agent and the opponent take turns making their actions for 10 rounds, with the autonomous agent taking its action first. This interaction terminates at the 10-th round, i.e., .

Opponent agent model:
A civilian behaves reactively to the actions taken by the autonomous agent, according to the probability shown in Table II.

Adversary: The primary goal of an adversarial opponent is to get close to the checkpoint as quick as possible to conduct a malicious attack. We use the following dense reward for the adversary, which increases as the distance from the checkpoint decreases:


We choose . The adversary policy is determined by (2), (3), and (12) together.

Iv-B Baseline

We compare our algorithm with a planning based baseline algorithm, CC-POMDP. In this framework, the observation probabilities of the POMDP model are assumed to be drawn from a probability distribution. The optimal policy is computed such that the corresponding value function is guaranteed to be higher than a maximized threshold with a certain (high) probability. Formally, the CC-POMDP optimal value function can be found by the iteration


where is obtained from the chance constraint optimization problem


where is the iteration index. is the modeled observation probability. is the probability distribution over the observation probability. is the updated belief from the belief state , after taking the action and observing the reaction . is a confidence bound, which typically takes a small value. We choose , as this results in the best performance.

Action/Reaction Stay Proceed
Hand 0.60 0.40
Loudspeaker 0.75 0.25
Flare 0.90 0.10
TABLE II: State dependent reward for active perception agent
Action/State Civilian Adversary
Hand -0.1 0
Loudspeaker -0.3 0
Flare -0.7 0
TABLE I: Observation probability of a civilian’s reaction

In order to solve the chance constraint optimization, the probability distribution , over the observation probability parameters

needs to be specified. We model these probability parameters as uncorrelated Gaussian random variables with the nominal probability parameters determined by Eq.

1 (with Table II), 2, 3, 12 as their mean values. We choose the variance of these Gaussian variables to be .

Iv-C Evaluation criterion

(a) Accumulated active perception loss against learning adversary
(b) True positive rate (TPR) against learning adversary
(c) True positive rate (TPR) against deceptive adversary
Fig. 3: Comparisons of two performance criteria between our proposed algorithm and CC-POMDP, against both a learning adversary ((a) and (b)) parametrized by the self-interest parameter and a deceptive adversary (in (c)) parametrized by : a small indicates high degree of adversary, and a small indicates high degree of deception; Low and high TPR are desired.

In order to evaluate the robustness of the active perception policy against un-modeled adversarial behaviors, we test the performance on two adversary models that have not been presented in the training phase:

Learning adversary: In the first testing adversary model, we apply the soft-Q learning algorithm to train an adversary policy with the following adversary reward


where the self-interest parameter controls the degree of self-interest, and its inverse can be interpreted as the degree of adversary. We choose different values of in the range , with step size . We favor the adversary by rendering him the access to the internal belief state of the autonomous agent. The adversary policy network takes this belief state, the immediate action taken by the autonomous agent and the observable distance from the checkpoint, outputting a probability distribution over its two available actions. We define the following two criteria to evaluate the performance of the active perception algorithms

  1. Accumulated active perception loss:

  2. True positive rate (TPR):

Deceptive adversary: In the second testing adversary model, we handcraft a deceptive strategy. A deceptive adversary uses the following policy when the autonomous agent takes actions hand or loudspeaker:


while only when the autonomous agent takes action flare, the adversary takes action proceed. is a free parameter that controls how deceptive the adversary is. We choose in the range , with step size .

These two adversary models are not present in the learning phase. The active perception agent has to identify the hidden intent while being robust to this model uncertainty, which is challenging.

Iv-D Results

Against a learning adversary: We present the performance criteria against a learning adversary in Fig. 2(a) and 2(b). These results are averaged over 10 learning simulations. Our algorithm shows better performance compared with the CC-POMDP baseline over the whole range of the self-interest parameter , despite the small difference. This is likely attributed to the ensemble modeling used in our algorithm, which results in better robustness against model uncertainty. Nonetheless, planning against an ensemble model is difficult to formulate in the POMDP framework.

Against a deceptive adversary: We present the TPR against a deceptive adversary in Fig. 2(c). The deterministic CC-POMDP policy achieves significantly lower TPR than the stochastic soft-Q learning policy does. As the CC-POMDP policy deterministically selects action hand when the belief is lower than some threshold, selects loudspeaker otherwise, and never selects flare, the deceptive adversary can maneuver this belief state by exploiting this deterministic structure. In contrast, since the soft-Q learning agent uses a maximum entropy stochastic policy, the adversary’s maneuvering strategy becomes less effective.

Fig. 4: Correlation between the adversary model uncertainty and the difference in performance between our algorithm and CC-POMDP: In general, a large policy uncertainty index corresponds to a large difference.

We anticipate that the difference in performance between our algorithm and CC-POMDP is correlated with the adversary model uncertainty, as illustrated in Fig. 2. We provide some evidence to justify this speculation in Fig. 4. We defined a policy uncertainty index , as one metric to quantify the deviation of the actual adversary policy from the assumed one



is an empirical estimation of the adversary policy from data tuples

collected in simulations. is the nominal adversary policy defined in Eq. 2. is the number of counts that appears in the data, where represents any value for . is the total number of data tuples; and is the total number of different data tuples appearring in simulations.

Fig. 4 shows that, generally speaking, a large model uncertainty corresponds to a large difference in performance. This observation explains the small difference between our algorithm and CC-POMDP against the learning adversary shown in Fig. 2(a) and 2(b), and the large difference against the deceptive adversary shown in Fig. 2(c). It also suggests that our algorithm has the desired property illustrated in Fig. 2, which might be a better trade-off between nominal and off-nominal performance than both POMDP and Nash Equilibrium.

V Conclusion

In this work, we pose an active perception problem against an opponent with uncertain intent and potential adversarial behaviors. We reviewed related fields of research and pointed out the gap between the existing approaches and the desired solution properties. We then presented a novel solution combining generative adversary modeling, belief space planning, and maximum entropy deep reinforcement learning. Compared with a CC-POMDP baseline, the proposed algorithm is more robust to un-modeled adversarial strategies. One limitation of this work is that we still need to specify an opponent behavior model, which could be non-trivial in complicated applications. We are developing learning algorithms that learn a reasonable opponent model through self-play, to address this limitation.


The authors want to thank Dr. Kasra Khosoussi for his insightful discussions.