I Introduction
In adversarial environments, such as sporting events, gaming, or even defending against a cyber attack, one side could gain an advantage by identifying the opponent’s strategy. The value of strategy identification lies in the ability it provides to foresee, and thereby counter, the opponent’s future actions. In combat games, for example, if an opponent’s strategy is identified as overly aggressive, one could lay a trap that exploits the opponent’s aggressive nature. However, an opponent’s strategy is not always apparent and may need to be estimated from observations of their actions. This paper proposes to use machine learning, specifically inverse reinforcement learning, to identify strategies in adversarial environments.
Strategic planning is often defined by four concepts. Goals are highlevel concepts that define what needs to be accomplished. Objectives are quantitative measures that determine if a goal has been achieved. Strategies are plans for achieving the defined objectives. Tactics are the lowlevel actions for carrying out a strategy. In the combat gaming example, the goal is to win the engagement. There are numerous objectives that could be associated with achieving this goal, including securing a particular target or minimizing casualties. Similarly, there are many strategies for achieving objectives; for example, being aggressive or defensive. Tactics are the specific sequence of actions carried out by each side in the engagement. The highlevel goal (to win) is known to both sides, and the lowlevel tactics are observable during the engagement. However, the objectives and the strategies of the opponent are unknown, but given the hierarchy previously defined the objectives and the strategies should be correlated with the tactics. Therefore, we hypothesis that the objectives and the strategies can be inferred from the observed tactics of the opponent.
Markov decision processes (MDPs) [13, 18] are often used in artificial systems to model sequential decision making. An MDP is primarily defined by states, actions, rewards, and a state transition function. A policy maps states to actions, and an optimal policy maximizes expected future reward. The components of an MDP map to the four strategic planning concepts previously defined. The goal is to maximize expected future reward. The tactics are the stateactions pairs that result from an implemented policy. The objectives and the strategy are represented by the reward function, i.e. the reward function is measurable and the parameters of the reward function define an agent’s behavior.
Reinforcement learning (RL) [20] is a common machine learning technique for learning optimal policies through interaction with an environment. The standard RL paradigm involves an agent selecting an action in a given state and then observing a state transition defined by the transition function and a reward defined by the reward function. By repeating this process and exploring using a probabilistic exploration policy that includes random actions, the agent can eventually learn an optimal policy. On the other hand, inverse reinforcement learning (IRL) [11] estimates a reward function from observations of an expert’s actions in an environment.
Inverse reinforcement learning has developed largely within the field of robotics. A few of the notable advances in robotics include training autonomous acrobatic helicopters [1], robot navigation through crowded rooms [6], and autonomous vehicles [25]. More recently there has been growing interest in applying IRL to human behavior problems in areas such as inferring cultural values [12], modelling human routines [3], and even so far as neuroscience models of human reasoning [4]. Unfortunately, theoretical solutions to the IRL problem remain challenging, requiring either strict mathematical assumptions such as linearity (e.g., [2, 26]) or large resource requirements such as sample or computation complexity (e.g., [14, 21]). As such, the field of IRL is a very active one that has huge potential as the theory continues to advance.
Multiagent IRL estimates reward functions in multiagent settings and most techniques rely heavily on gametheoretic principles [9, 8]. However, this reliance makes their use impractical when notions of equilibria cannot be defined. At a high level, multiagent IRL seems like a suitable construct for strategy identification in combat scenarios, e.g. each side could be considered an agent and each side may be composed of multiple agents, but the multiagent IRL construct and algorithms need significant advancement before they can become feasible.
This paper proposes that IRL can be used to recover unique reward functions that are correlated with strategies in adversarial environments such as combat games. Specifically, the contributions of this work are 1) the demonstration of this concept on gaming combat data using three predefined strategies and 2) the framework for using IRL to achieve strategy identification. The IRL algorithm used in this study utilizes kernel methods to realize expressive functions of the state space [17, 5]. IRL has been used to create agent based models [7], identify decision agents [15], and to estimate strategy in other domains, including animal behavior [22], table tennis [10], and financial trading [23, 24]. However, this is the first study to apply IRL to estimating strategies for combat games.
Further, IRL has two traits that make the technique well suited for strategy identification. First, the reward function is independent of the environment dynamics, but the policy is dependent upon the both the reward function and the environment dynamics. This is analogous to the concepts in strategic planning where the strategy and objectives are most likely independent of lowlevel concepts like setting but the tactics are not. Second, the estimated reward function can be used for prediction if the dynamics change. Simply estimating the policy in this situation is not sufficient as it may no longer be optimal and therefore may not be directing the agent’s actions. The reward function could be used in conjunction with a modelfree learner to estimate a new policy and thus predict future actions of the adversary.
This paper is organized as follows. Section II provides background information on MDPs, RL, and IRL. Section III outlines the IRL algorithms used for strategy identification. Section IV describes the numerical experiments performed to validate the IRL methods. Section V provides our conclusions from the numerical experiments and possible areas of future work.
Ii Background
General RL theory rests on Markov decision process models (MDPs). In this paper an MDP is defined as any tuple where is a set of states, is a set of actions,
is a transition probability function satisfying
, and is a reward function.Given an MDP any function satisfying will be called a policy function. Every possesses a unique value function and actionvalue function defined as
where is called a discount factor,
is a random variable defined on
and is distributed according to(1) 
The goal of RL algorithms is to learn an optimal policy, , for an MDP where optimality is defined as . The goal of IRL algorithms is to learn a reward , assuming an MDP where is not known, for which a given policy is optimal. Because the traditional IRL problem statement has known degenerate solutions additional requirements are often added to yield useful solutions.
Match  Red  Blue  Match Time  Red Deaths  Blue Deaths  

Count  Count  Count  Min  Avg  Max  Min  Avg  Max  Min  Avg  Max  
Fallback  11  12  11  54s  202s  468s  1  7.6  11  1  7.1  11 
Assault  12  12  11  57s  192s  402s  4  9.3  11  3  8.6  11 
Flank  13  12  11  54s  162s  279s  1  6.8  11  3  8.1  11 
Item  Definition 

A state is defined as the current match time, the x,y position and health of all red and blue forces (sans one blue force player), and the position and health of our agent (who replaces the missing blue force player).  
There are 9 actions: stay in place or move one step in one of the 8 cardinal/ordinal directions.  
All transitions are deterministic with our controlled agent moving according to the action selection and all other agents moving to their positions according to the prerecorded data  
Unknown  
Iii Algorithms
Iiia Reward Learning
IiiA1 IRL Algorithm
To learn rewards the projection variant of [2] was utilized and extended via kernelbased methods as described in [17]. For our purposes we define a kernel as any that induces a positive semidefinite Gram matrix where .
Intuitively, kernel methods can be thought of as either a nonlinear extension to linear function approximation techniques or as functions in a space . A particularly common interpretation for kernels is as similarity measures. This interpretation is most often taken when ’s value is in .
Mathematically, kernels can be interpreted as an inner product between vectors in
. That is, . This definition also allows for the normwhich is useful for cluster analysis and visualizations.
In the original projection IRL algorithm, which is extended below, the goal is to match feature expectation given some . In the kernelbased extension feature expectation becomes the kernel expectation, defined as
(2) 
The complete kernelbased algorithm is provided in Algorithm 1. It should be noted that the algorithm as stated creates a set of reward functions whose optimal policies have a convex combination that approximates the expert. For our analysis we required a single reward function and simply chose the from this set whose optimal policy had the smallest .
IiiA2 IRL Implementation
Our implementation of the above algorithm uses an empirical estimate of calculated from some sample of observed expert trajectories drawn with probability as defined in Equation 1. Several estimation techniques were tested and the one that seemed to produce the best rewards (determined via human inspection) was
(3) 
where was chosen arbitrarily to be 20 and is a function which returns the total number of states in . Such a formulation means that longer episodes have more weight since they’ll have more states and determines how far into the future to consider when estimating the expert’s reward. Equation 3 was also used when estimating on line 4 and 10 in Algorithm 1.
The final component that needs to be implemented for a specific algorithm is the kernel . For our kernel we calculated 6 features to describe the present location of our agent: (1) min distance to living red, (2) max distance to living red, (3) min distance to living blue, (4) max distance to living blue, (5) min cos similarity between living red and blue and (6) max cos similarity between living red and blue. These features were scaled appropriately so that each feature was in and then a Gaussian kernel was applied.
IiiB Policy Learning
IiiB1 RL Algorithm
The utilized IRL algorithm requires solving for optimal polices given a reward function on each iteration. To satisfy this requirement, an empirical estimate policy iteration method was used. Our approach is an nstep, modelfree, Monte Carlo, onpolicy Qfunction approximation (see Algorithm 2) and was originally described in [17].
Experiments show that our direct iteration method outperforms other wellknown RL algorithms (see Figure 1) given relatively small training samples from our MDP. All comparison RL algorithms come from the Stable Baseline 3 project [16] and have been moderately tuned with respect to hyperparameters. Every algorithm was given a budget of 10,000 interactions with the environment before comparing the results.
We believe our improved performance is largely due to the simplicity of our MDP and small sample sizes when training rather than an algorithmic advancement, though more analysis is necessary to confirm. For reference purposes Figure 1 also includes value iteration which represents the optimal solution. During final analysis value iteration was not used due to intractable memory on the full problem.
IiiB2 RL Implementation
When implementing Algorithm 2 hyper parameter tuning was used to select the values for , , and which gave the best expected value for random rewards. Of the four parameters, only resulted in counterintuitive behavior, demonstrating performance degradation if was too small or too large. For , and performance generally increased monotonically with decreasing returns.
Additionally, separate experiments were conducted to evaluate various regression learners for line 13 in Algorithm 2
. We tested an SVM kernel regressor, a linear regressor, an AdaBoost regressor and a decision tree regressor. Of these regressors the decision tree performed best when using the same features as those used to calculate the IRL reward kernel.
Line 7 in Algorithm 2
allowed for exploring starts. Four exploration heuristics were evaluated: (1) random selection, (2) greedy selection, (3) epsilongreedy selection, and (4) softmax selection. Of these four methods for exploring starts softmax selections performed best.
Finally, three other variations on the algorithm were tested. First, bootstrapping the update target resulted in decreased performance. Second, using an every visit MC target (cf. [19]) for updates instead of an nstep target gave a small increase in performance for small state spaces but a large decrease in performance in large state spaces. And third, modifying the algorithm to only use value observations from the most recent policy iteration (i.e., clearing between line 8 and 9 in Algorithm 2) resulted in decreased performance across the board.
Iv Numerical Experiments
Using the above algorithms three experiments were performed to explore the strengths and weaknesses of using IRL to analyze strategic behavior. To drive these experiments 36 matches involving combat engagements between two opposing forces (referred to as red and blue) were simulated using a gaming combat simulator. In each match both forces were comprised of three fireteams containing three to four AI controlled players each. Matches always occurred in an open field, and the amount of initial space separating the forces was varied slightly (see Figure 2 for an example starting position in an match). At a low level players were controlled by the native AI within the combat game. At a high level forces were nudged to follow one of three strategies: assault, flank or fallback. The red force always assaulted while the blue force assaulted 12 times, flanked 13 times and fellback 11 times. Summary statistics for the data can be found in Table I.
During the 36 simulated matches the location of every AI player, which side they were on, and their health was recorded every 3 seconds. These observations were stored and used during IRL analysis to construct expert state trajectories. Because the IRL algorithm used kernel methods we needed to define . In this analysis, was a similarity kernel that mapped pairs of states to real numbers between 0 and 1 with pairs of similar states mapping closer to 1 and pairs of dissimilar states mapping closer to 0. When determining the similarity of two states considered the minimum distances to blue and red forces, the maximum distances to blue and red forces, and the angles of fire between blue and red forces.
In addition, to learn rewards from the above data using the algorithms in this paper the forward RL problem needed to be solved. To facilitate this a simplified MDP simulation was developed where we could take over individual agents and move them at will through any of the 36 prerecorded matches. Our MDP was far simpler than the original simulation in that it only generated data at the same fidelity of the recorded observations and the environment did not respond differently to deviations in our controlled agent’s behavior.
Despite these huge simplifications IRL was still able to learn meaningful rewards. The full description of the states, actions and transition function in the simulated MDP environment are provided in Table II.
Using the 36 recorded data sets from the combat game, the kernel , and the simplified MDP it was then possible to both empirically estimate the kernel expectation for the expert policy (see Equation 2) in each data set as well as well as learn a reward function which generated a policy similar to the expert. Using these component, a labeled test was constructed for strategy identification experiments. The label was the blue force’s high level strategy directive, behavior was represented as the empirical kernel expectation for the blue force, and reward was represented as the kernelbased reward function learned via IRL for the blue force.
Three experiments were conducted using the strategy labeled data set: (1) see if a tSNE plot would show visible separation of strategies with either the expert kernel expectations or reward functions, (2) see if unsupervised learning techniques would be able to cluster strategies with either expert kernel expectations or reward functions and (3) see if a classifier could be trained to identify strategy given either an expert kernel expectation or reward function.
To conduct the tSNE experiment it was necessary to calculate a distance matrix for both the kernel expectations and the reward functions. This was done via the norm induced by . For example, if and then the distance between and was calculated as
The tSNE plot generated from the distance matrices seemed to show slight visual improvements in strategy separation for reward functions over their subsequent expert kernel expectation (see Figure 3).
While care should be taken when interpreting tSNE plots it was interesting to see that when visualizing reward functions flanking strategies were placed in between assault and fallback strategies. This aligned with the data recordings where assaults tended to be direct charges, fallback tended to be direct retreats and flanking strategies tended to be a mix of strategic approaches and fallbacks.
To conduct the cluster experiments square distance matrices were needed once again. As before these were calculated using the norm induced by . Using the distance matrices, a hierarchical agglomerative clustering algorithm, and a complete linkage function two dendrograms were generated (see Figure 4). Clear patterns can be seen in the reward function clusters where 82% of fallback strategies belong to the second cluster and 80% of flank/assault strategies belong to the third cluster.
An interesting pattern that emerged in the cluster analysis was the first cluster in the kernelbased reward functions. At first glance there is no apparent pattern to this cluster which contains 3 flank, 2 fallback and 2 assault data sets. However, upon further analysis it was found that these 7 data sets represented all the matches where red and blue forces were placed within very close proximity before starting the match. Matches that began with forces in close proximity were considerably more chaotic than mid range and far matches. In close matches units often died quickly and the underlying combat simulator AI had little time to plan for the strategy nudges that were provided.
For the final experiment a kernelbased SVM classifier was trained using a onetoone approach to handle three classes. The classifier was evaluated on overall prediction accuracy using leave one out on all training points. Using expert kernel expectation gave an overall accuracy of 61% while reward functions gave an overall accuracy of 72%. The class specific breakdowns can be seen in the confusion matrix in Figure
5.The results here followed a similar pattern to the two previous experiments. The classifier from the rewards outperformed the classifier from the kernel expectations, and assault and fallback strategies were easiest to distinguish while flanking and assaulting were commonly mislabeled. The four reward classifier mistakes that confused flanking and fallback data also belonged to the first cluster in the reward dendrogram. Visual inspection found that these four data sets had many early deaths making it difficult for the low level AI to form coherent strategies.
At the end, one final experiment was conducted. This is one that kernel feature expectations would be unable to do and so there is no comparison to make. Using our learned reward function we replaced one of the AI players in a match to see how closely a reward generated policy would match the replaced player’s trajectory. The result of this can be seen in Figure 6.
V Conclusion
In this paper we developed a data driven technique to define an MDP and learn a reward function which explains observed behavior. To validate this technique we generated 36 simulated engagements within a highfidelity combat simulator and examined reward functions learned from observations of the engagement. We were able to show that by examining the reward functions learned through our technique more accurate predictions could be made about the strategy that generated the observed behavior.
In the future we plan on extending this method to a more general multiagent formulation for IRL. Such an extension seems important for problems where many agents interact with varying goals and skill levels. Finally, we hope to provide a more sophisticated explanation for why our direct estimate method seems to beat out other RL algorithms in the limited resource use case.
Acknowledgment
This work was supported by Systems Engineering, Inc and the Office of Naval Research under grant number N0001420C2011.
References
 [1] (2007) An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems 19, pp. 1. Cited by: §I.
 [2] (2004) Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on machine learning, pp. 1 (en). Cited by: §I, §IIIA1.
 [3] (2016) Modeling and understanding human routine behavior. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 248–260. Cited by: §I.
 [4] (201910) Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences 29, pp. 105–110. Cited by: §I.

[5]
(2018)
Imitation learning via kernel mean embedding.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 32. Cited by: §I.  [6] (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. The International Journal of Robotics Research 35 (11), pp. 1289–1307. Cited by: §I.
 [7] (2017) Agentbased model construction using inverse reinforcement learning. In 2017 Winter Simulation Conference (WSC), pp. 1264–1275. Cited by: §I.
 [8] (2019) Multiagent inverse reinforcement learning for certain generalsum stochastic games. Journal of Artificial Intelligence Research 66, pp. 473–502. Cited by: §I.
 [9] (2017) Multiagent inverse reinforcement learning for twoperson zerosum games. IEEE Transactions on Games 10 (1), pp. 56–68. Cited by: §I.
 [10] (2013) Inverse reinforcement learning for strategy extraction. In MLSA13–Proceedings of ‘Machine Learning and Data Mining for Sports Analytics’, workshop@ ECML/PKDD 2013, pp. 16–26. Cited by: §I.
 [11] (2000) Algorithms for inverse reinforcement learning.. In Icml, Vol. 1, pp. 2. Cited by: §I.
 [12] (2012) A cultural decisionmaking model for negotiation based on inverse reinforcement learning. In Proceedings of the 34th annual conference of the cognitive science society, pp. 2097–2102. Cited by: §I.
 [13] (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §I.
 [14] (2011) Inverse reinforcement learning with Gaussian process. In Proceedings of the 2011 American control conference, pp. 113–118. Cited by: §I.
 [15] (2013) Recognition of agents based on observation of their sequential behavior. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 33–48. Cited by: §I.
 [16] (2019) Stable baselines3. GitHub. Note: https://github.com/DLRRM/stablebaselines3 Cited by: §IIIB1.
 [17] (2020) Human apprenticeship learning via kernelbased inverse reinforcement learning. External Links: 2002.10904 Cited by: §I, §IIIA1, §IIIB1.
 [18] (2018) On the practical art of state definitions for Markov decision process construction. IEEE Access 6, pp. 21115–21128. Cited by: §I.
 [19] (199601) Reinforcement Learning with Replacing Eligibility Traces. Machine Learning 22 (1), pp. 123–158. External Links: ISSN 15730565, Link, Document Cited by: §IIIB2.
 [20] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I.
 [21] (2017) Largescale cost function learning for path planning using deep inverse reinforcement learning. The International Journal of Robotics Research 36 (10), pp. 1073–1087. Cited by: §I.
 [22] (2018) Identification of animal behavioral strategies by inverse reinforcement learning. PLoS computational biology 14 (5), pp. e1006122. Cited by: §I.
 [23] (2015) Gaussian processbased algorithmic trading strategy identification. Quantitative Finance 15 (10), pp. 1683–1703. Cited by: §I.

[24]
(2014)
Algorithmic trading behavior identification using reward learning method.
In
2014 International Joint Conference on Neural Networks (IJCNN)
, pp. 3807–3414. Cited by: §I.  [25] (2019) Advanced planning for autonomous vehicles using reinforcement learning and deep inverse reinforcement learning. Robotics and Autonomous Systems 114, pp. 1–18. Cited by: §I.
 [26] (2008) Maximum entropy inverse reinforcement learning.. In Aaai, Vol. 8, pp. 1433–1438. Cited by: §I.
Comments
There are no comments yet.