1 Introduction
Datadriven learning is stateoftheart in many domains of artificial intelligence (AI), but raw statistical performance is secondary to the trust, understanding and safety of humans. For autonomous agents to be deployed at scale in the real world, people with a range of backgrounds and remits must be equipped with robust mental models of their learning and reasoning. However, modern learning algorithms involve complex feedback loops and lack semantic grounding, rendering them black boxes from a human perspective. The field of explainable artificial intelligence (XAI) [5] has emerged in response to this challenge.
Most work in XAI focusses on developing insight into classification and regression systems trained on static datasets. In this work we consider dynamic problems comprising agents interacting with their environments. We present our approach to interpretable imitation learning (I2L), which aims to model the policy of a black box agent from analysis of its inputoutput statistics. We call the policy model interpretable because it takes the form of a binary decision tree, which is easily decomposed and visualised, and can be used for both factual and counterfactual explanation [3]. We move beyond most current work in the imitation learning literature by explicitly learning a latent state representation used by the agent as the basis for its decision making. After formalising our approach, we report the initial results of an implementation in a traffic simulator.
2 I2L Framework
2.0.1 Preliminaries
Our I2L approach is applied to agents that operate in Markov Decision Process (MDP) environments. At time
, an agent perceives the current Markov state . We then assume that it maps this state into an intermediate representation then selects an action from the discrete space via a deterministic policy function . The next Markov state is a function of both and . Modelling a statedependent reward function is not necessary for the present work.2.0.2 Generic Problem Formulation
We adopt the perspective of a passive spectator of the MDP with access to its Markov state and the action taken by the agent at each time step. This allows us to observe a history of states and actions to use as the basis of learning. Our objective is to imitate the agent’s behaviour, that is, to reverseengineer the mapping . This effectively requires approximations of both and , denoted by and respectively. The need to infer both the policy and the representation on which this policy is based makes this problem a hybrid of imitation learning and state representation learning [4].
It is essential to constrain the search spaces for and ( and respectively) so that they only contain functions that are humaninterpretable, meaning that their operation can be understood and predicted via visualisations, naturallanguage explanations or brief statements of formal logic. This property must be achieved while minimally sacrificing imitation quality or tractability of the I2L problem. Given the history of stateaction pairs , this problem can be formulated as an optimisation over and :
(1) 
and
is a pairwise loss function over the discrete action space. The schematic in figure
1^{1}^{1}1Icons from users Freepik and Pixel Perfect at www.flaticon.com. outlines the task at hand.2.0.3 Space of Representation Functions
As highlighted above, it is important that
permits humaninterpretable state representations. We achieve this by limiting its codomain to vectors of realvalued features, generated from
through recursive application of elementary operations from a finite set . To limit the complexity of , we specify a limit on the recursion depth. Prior domain knowledge is used to design , which in turn has a shaping effect on without requiring the stronger assumption of which precise features to use. Each feature also has a clear interpretation by virtue of its derivation via canonical operations. In our traffic simulator implementation, contains operations to extract a vehicle’s speed/position, find the nearest vehicle/junction ahead of/behind a position, and subtract one speed/position from another.2.0.4 Space of Policy Functions
must be similarly limited to functions that are comprehensible to humans while retaining the representational capacity for highquality imitation. Following prior work [2, 6], we achieve this by adopting a decision tree structure. The pairwise loss function over the action space is used to define an impurity measure for greedy tree induction. For a decision node , let be the proportion of data instances at that node with action value . We use the following measure of node impurity:
(2) 
The popular Gini impurity is a special case of this measure, recovered by defining such that if , and otherwise.
2.0.5 Learning Procedure
In general, the joint inference of two unknown functions (in this case and ) can be very challenging, but our constraints on and allow us to approximately solve equation 1 through a sequential procedure:

Apply domain knowledge to specify the featuregenerating operations and recursion depth . Denote the representation of all valid features .

Iterating through each state in the observed stateaction history , apply to generate a vector of numerical feature values. Store this alongside the corresponding action in a training dataset.

Define a pairwise action loss function
and deploy a slightlymodified version of the CART tree induction algorithm
[1] to greedily minimise the associated impurity measure (equation 2) on the training set. Let the induction process continue until every leaf node is pure, yielding a large, overfitted tree . 
Prune the tree back using minimal cost complexity pruning (MCCP) [1], whose output is a tree sequence , each a subtree of the last, representing a progressive reduction of down to its root.

Pick a tree from this sequence to use as the policy model , and define as the subset of features from used at least once in that tree.
Having a sequence of options for the tree model and associated representation allows us to manage a tradeoff between accuracy on one end, and interpretability (through simplicity) on the other. In the following implementation, we explore the tradeoff by selecting several trees from across the pruning sequence.
3 Implementation with a Traffic Simulator
We implement I2L in a traffic simulator, in which multiple vehicles follow a common policy to navigate a track while avoiding collisions. Five track topologies are shown in figure 3 (left). Coloured rectangles are vehicles and red circles are junctions. Since the policy is homogeneous across the population we can analyse the behaviour of all vehicles equally to learn and .
3.0.1 Target Policies
Instead of learned policies, we use two handcoded controllers as the targets of I2L. From the perspective of learning these policies remain opaque. For both policies, contains five discrete acceleration levels, which determine the vehicle’s speed for the next timestep within the limits .

Fullyimitable policy (): a rule set based on six features including the vehicle’s speed, distance to the next junction, and separations and speeds relative to other agents approaching that junction. This policy is itself written as a decision tree with leaves, hence can be modelled exactly in principle.

Partiallyimitable policy (): retains some logic from , but considers more nearby vehicles and incorporates a proportional feedback controller. These changes cannot be modelled exactly by a finite decision tree.
3.0.2 Representation
We use a set of eight operations which allows the sixfeature representation used by (denoted ) to be generated. These are:

or . Get the position of vehicle or junction .

. Get the speed of vehicle .

. Find the next junction in front of vehicle .

or . Find the next vehicle in front of vehicle or junction .

or . Find the next vehicle behind vehicle or junction .

. Flip between the two track positions comprising a junction.

. Compute the separation between two positions.

or . Subtract two speeds or separations.
With , has features, including all six in
. The vast majority are irrelevant for imitating the target policies, so we face a feature selection problem.
3.0.3 Training
We run simulations with vehicles on all five topologies in figure 3 (left). The recorded history is converted into a training dataset by applying for each vehicle. After rebalancing action classes, our final dataset has a length of . For tree induction, we use the simple loss function
. For each tree in the sequence output by MCCP, we measure the predictive accuracy on a validation dataset, and consider the number of leaf nodes and features used as heuristics for interpretability. Figure
2 plots these values across the sequence for both targets. We select five pruning levels (in addition to ) for evaluation.3.0.4 Baselines
We also train one tree using only the six features in , and another with an alternative representation , indicative of what might be deemed useful without domain knowledge (radius, angle, normal velocity and heading of nearby agents in egocentric coordinates). Two further baselines use training data from only one track topology (either the smallest topology A or the largest D). This tests the generalisability of singletopology learning. For all baselines, we use the pruned tree with the highest validation accuracy.
4 Results and Discussion
The heat maps in figure 3 (right) contain results for two metrics of imitation quality for both and . Each row corresponds to a pruned tree or baseline, and each column indicates the track topology used for testing.
4.0.1 Accuracy
This metric is predictive accuracy on a heldout test set. For both and , mean accuracy exceeds for even the smallest prune levels and for the secondsmallest. The fact that accuracy for is bounded at around reflects the fact that this policy is not a decision tree so cannot be perfectly imitated by one. As expected, providing upfront (thereby removing the representation learning requirement) yields somewhat better accuracy, but it is promising to see that down to prune level , performance differs by under for both targets. Lacking information about junctions, the tree using is unable to obtain the same levels of accuracy, demonstrating the importance of choosing the correct representation for imitation. The singletopology training results show that trees generalise well from the large topology D, but poorly from the small topology A. This suggests the latter contains insufficient variety of vehicle arrangements to capture all important aspects of the target policies.
4.0.2 Failure
Here we deploy the models as control policies in the environment and measure the mean time between failures (either collision or ‘stalls’, when traffic flow grinds to a halt) over episodes of up to timesteps. A value of indicates zero failures occurred. While results broadly correlate with accuracy, this metric shows a more marked aggregate distinction between and , and between different test topologies. Nonetheless, there is minimal degradation with pruning down to level , with a constant average of just  failures for , and  for . In fact, it appears that intermediate pruning levels fail less often than the largest trees. While the reason for this is not immediately clear, it may be that having fewer leaves yields less frequent changes of acceleration and smoother motion. Providing confers no significant benefit over , while and singletopology training on A are utterly unable to perform/generalise well.
5 Conclusion
We have introduced our approach to interpretable imitation learning for black box agent control policies that use intermediate lowdimensional state representations. Our models take the form of decision trees, which select from large vectors of candidate features generated from the Markov state using a set of basic operations. The accuracyinterpretability tradeoff is managed by postpruning.
Our initial implementation has shown that trees trained by I2L exhibit high predictive accuracy with respect to two handcoded control policies, and are able to avoid failure conditions for extended periods, even when heavily pruned. It has also highlighted that using a plausiblebutincorrect state representation places a severe limitation on imitation quality, and that learning from data that do not capture the full variation of the environment leads to poor generalisation.
In ongoing followup work, we are exploring how our decision tree models can be used to interpret and explain their target policies, and are also implementing I2L with a truly black box policy trained by reinforcement learning.
0.6
References

[1]
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees. Wadsworth & Brooks. Cole Statistics/Probability Series (1984)
 [2] Coppens, Y., Efthymiadis, K., Lenaerts, T., Nowé, A., Miller, T., Weber, R., Magazzeni, D.: Distilling deep reinforcement learning policies in soft decision trees. In: Proceedings of the IJCAI 2019 Workshop on Explainable AI. pp. 1–6 (2019)
 [3] Guidotti, R., Monreale, A., Giannotti, F., Pedreschi, D., Ruggieri, S., Turini, F.: Factual and counterfactual explanations for black box decision making. IEEE Intelligent Systems (2019)

[4]
Lesort, T., DíazRodríguez, N., Goudou, J.F., Filliat, D.: State representation learning for control: An overview. Neural Networks
108, 379–392 (2018)  [5] Samek, W., Wiegand, T., Müller, K.R.: Explainable AI: Understanding, visualizing and interpreting deep learning models. arXiv:1708.08296 (2017)
 [6] Turnbull, O., Lawry, J., Lowenberg, M., Richards, A.: A cloned linguistic decision tree controller for realtime path planning in hostile environments. Fuzzy Sets and Systems 293, 1–29 (2016)
Comments
There are no comments yet.