Imitation learning typically performs training and testing in the same environment. This is by necessity as the Markov Decision Process (MDP) formalism defines a policy on a particular state space. However, real world environments are rarely so cleanly defined and benign changes to the environment can induce a completely new state space. Although deep imitation learning (Ho and Ermon, 2016) still defines a policy on unseen states, it remains extremely difficult to effectively generalize (Duan et al., 2017).
Domain adaptation addresses how to generalize a policy defined in a source domain to perform the same task in a target domain (Higgins et al., 2017). Unfortunately, this objective is inherently ill-defined. One wouldn’t expect to successfully transfer from a 2D gridworld to a self-driving car, but there is ambiguity in how to define a similarity measure on MDPs.
Third-person imitation (Stadie et al., 2017) resolves this ambiguity by considering transfer between isomorphic MDPs (formally defined in Section 2), where the objective is to observe a policy in the source domain, and imitate that policy in the target domain. In contrast to domain adaptation between unaligned distributions, the dynamics structure constrains the space of possible isomorphisms, and in some cases the source and target may be related by a unique isomorphism.
We consider an idealized setting for third-person imitation with complete information about the source domain, where we perfectly understand the dynamics and the policy to be imitated. This work offers a theoretical analysis, in particular demonstrating that restricting to isomorphic MDPs with complete knowledge does not trivialize the problem. Specifically, regarding how the agent may observe the target domain, we consider two regimes, summarized in Figure 1:
In the offline regime (Section 4), an oracle perfectly transfers the source policy into the target domain, and the agent observes trajectories from the oracle policy (without seeing the oracle’s actions). In this regime, we provide positive results establishing that with limited, state-only observations in the target domain, we can still efficiently imitate a policy defined in the source domain (Theorem 4.10).
A Motivating Example:
To clarify the setup and distinguish the two observation regimes, we elaborate upon an example. Suppose our source domain is a video game, where the state space corresponds to the monitor screen and the action space corresponds to key presses. And we wish to imitate an expert player of the game. The target domain is the same game played on a new monitor with higher screen brightness. Clearly the underlying game hasn’t changed, and there is a natural bijection from screen states of the target monitor to those of the source monitor, namely “dimming the screen”.
On the one hand, in the offline setting, we’re forbidden from playing on the new monitor ourselves. Instead we observe recordings of the expert, played on the brighter monitor. Again, as these are recordings, we see the states the expert visits but not their actions. On the other hand, in the online setting, we simply run transitions on the brighter monitor. Note that if the screen includes benign features which minimally impact the game (say the player’s chosen name appears onscreen), it may be very difficult to learn the bijection between target and source monitor. Either way, through observations we guess a new policy to played on the bright monitor, which hopefully mimics the expert’s behavior.
Summary of Contributions:
Our primary contribution in this work is a provably efficient algorithm for offline third-person imitation, with an polynomial upper bound for the sample complexity necessary to control the imitation loss. Our main technical novelty is a means of clipping the states of a Markov chain according to their stationary distribution, while preserving properties of a bijection between isomorphic chains. We also prove an algorithm-agnostic lower bound for online third-person imitation, through reduction to bandit lower bounds.
We consider a source MDP without reward , and target MDP . To characterize an isomorphism between and , we assume the existence of a bijective mapping , such that and . Note that in this notation, is not a policy.
We also fix an ordering of the states so that may be written in matrix form as a permutation matrix. In particular, we will overload notation to use as a permutation on , such that denotes that . Let denote the space of permutation matrices.
A policy maps states to distributions on actions, but for our purposes it will be convenient to consider the policy as a matrix . To relate the two notions, is a block of diagonal matrices for each action, where , and .
The dynamics matrix is denoted . It can also be decomposed into blocks where , and .
Using this notation, forms the Markov chain on induced by following policy . Explicitly,
Note that under this notation, the dynamics and initial distribution in can be written as and respectively. The occupancy measure is defined with regard to a policy, as well as the underlying dynamics and initial distribution. Specifically, , where the dependence on the dynamics is through the sampling of a trajectory .
Similarly, we introduce the state-only occupancy measure . We will make use of the identity , as well as the fact that is the stationary distribution of the Markov chain , which both follow from the constraint-based characterization of occupancy (Puterman, 1994).
The value function for a given policy and reward function is defined as
We note the very useful identity .
Lastly, we use the notation to denote the
th largest singular value of.
2.2 Observation Settings
To begin, we’re given full knowledge of the source domain , as well as and , the policy and corresponding occupancy measure we want to imitate. We consider two settings through which we can interact with the target domain, in order to learn how to adapt into this new domain.
In the offline setting, we only observe the policy being played in . We can consider as an oracle for third-person imitation, as this policy exactly maps from to , calls , and maps back. To guarantee the trajectories don’t get trapped in a terminal state, we assume this agent has a
reset probability. Through these observations, we must output a policyto be played in . We provide upper bounds for this setting in Section 4.
Crucially, in this setting we assume access to the states but not actions from observed trajectories, in the imitation from observation setting (Sun et al., 2019). This assumption is well-motivated. In practice, observed trajectories from an expert often come from video, where actions are difficult to infer (Liu et al., 2018). Additionally, the problem becomes trivial with observed actions, as one may mimic the oracle’s actions at each state in without trying to understand at all.
In the online setting, we define our own policy to play in at each timestep , with full observation of the trajectories. After total transitions we output our final policy . Intuitively, this setting allows for more varied observations in the target domain. But without an expert oracle to demonstrate the correct state distribution, an agent in this setting may be deceived by near-symmetry in the dynamics and predict the wrong alignment. We further highlight this difficulty in Section 5.
2.3 Imitation Objective
In either setting, through observations from the target domain we output a policy . The corresponding occupancy measure we denote as , where the subscript reflects the dependence on the dynamics and initial distribution in , namely and .
We measure imitation by comparing the correctly transferred policy against the guessed policy . Explicitly, our objective is
As a sanity check, we confirm that if we play , then indeed and the occupancies are equal.
A form of this objective with a general IPM as the distributional distance was introduced in (Ho and Ermon, 2016). To justify using this loss, note the objective can be equivalently written . In other words, minimizing imitation objective guarantees and perform nearly as well for any reward function with a bound on maximum magnitude.
3 Related Work
The theory of imitation learning depends crucially on what interaction is available to the agent. Behavior cloning (Bain, 1995) learns a policy offline from supervised expert data. With online data, imitation learning can be cast as a measure matching problem on occupancy measures (Ho and Ermon, 2016). With an expert oracle, imitation learning has no-regret guarantees (Ross et al., 2011). Numerous of these algorithms for imitation learning can be adapted to the observation setting (Torabi et al., 2018a, b; Yang et al., 2019).
General domain adaptation for imitation learning has a rich applied literature (Pastor et al., 2009; Tobin et al., 2017; Ammar et al., 2015). Third-person imitation specifically was formalized in Stadie et al. (2017), extending the method of Ho and Ermon (2016) by learning domain-agnostic features. Other deep algorithms explicitly learn an alignment between the state spaces, based on multiple tasks in the same environments (Kim et al., 2019) or unsupervised image alignment (Gamrian and Goldberg, 2019).
The closest work to ours is Sun et al. (2019), which shares the focus on imitation learning without access to actions, but differs in studying the first-person setting primarily with online feedback. This work also takes inspiration from literature on friendly graphs (Aflalo et al., 2015), which characterize robustly asymmetric structure.
4 Offline Imitation
4.1 Markov Chain Alignment
Because the offline setting only runs policy , and reveals no actions, it is equivalent to observing a trajectory of the state-only Markov chain induced by in . Let us elaborate on this fact.
Define the Markov chain , which is ergodic when restricted to the strongly connected components that intersect the initial distribution. In , the dynamics are , the oracle policy is , and the initial distribution is . We also assume the oracle agent following has a reset probability.
All together, this implies our observations in the offline setting are drawn from a trajectory of . In summary, given full knowledge of and a trajectory sampled from , our algorithm will seek to learn the alignment in order to approximate , hopefully leading to low imitation loss.
4.2 Symmetry without approximation
As a warmup, we consider the setting with no approximation where we observe exactly. To relate this chain to , we can try to find symmetries, i.e. the minimizers of
We can equivalently consider finding automorphisms of , which may be posed as a minimization over permutation matrices :
Clearly both these objectives are minimized at 0. Intuitively, to recover we’d like to be the unique minimizer of (4), or equivalently to be the unique minimizer of (5). Hence, in order to make third-person imitation tractable, we will seek to bound (5) away from when , or in other words focus on Markov chains which are robustly asymmetric.
We introduce notation:
Definition 4.1 (Rescaled transition matrix).
For an ergodic Markov chain with stationary distribution , let and define as the rescaled transition matrix of .
Definition 4.2 (Friendly matrix).
A matrix is friendly
if, given the singular value decomposition, has distinct diagonal elements and has all non-zero elements. Similarly, a matrix is -friendly if and elementwise. An ergodic Markov chain is friendly if its rescaled transition matrix is friendly.
The significance of friendliness in graphs was studied in Aflalo et al. (2015), to characterize relaxations of the graph isomorphism problem. We first confirm several friendliness properties for Markov chains still hold.
For a permutation matrix , if and only if and .
Suppose . If is the stationary distribution of , then . So by uniqueness of the stationary distribution in an ergodic chain, and therefore . Then clearly and therefore .
For the reverse implication, . ∎
If is friendly, then it has a trivial automorphism group.
In what follows, for any SVD, we will always choose to orient such that elementwise.
4.3 Exact Symmetry Algorithm
By Proposition 4.3, the automorphism group of is contained in the automorphism group of the rescaled transition matrix . Interpreting as a weighted graph, determining its automorphisms is at least as computationally hard as the graph isomorphism problem (Aflalo et al., 2015).
In general, algorithms for graph isomorphisms optimize time complexity, whereas we are more interested in controlling sample complexity. Nevertheless, we have the following result:
Given and , if is a friendly Markov chain, there is an algorithm to exactly recover in time.
This result is a simple extension of the main result in Umeyama (1988), applying the friendliness property to Markov chains rather than adjacency matrices. But the characterization of automorphisms will be used again later to control sample complexity, when we only observe through sampled trajectories.
We begin with the following:
Given two friendly matrices decomposed as and , suppose . Then is the unique permutation which satisfies .
Clearly . Rewriting with the SVD gives .
Rearranging, this implies commutes with . Commuting with a diagonal matrix with distinct elements implies is diagonal. As this product is also unitary and real, it must be that where is diagonal and .
Again rearranging, this implies . By the assumption on the SVD orientation, must preserve signs, therefore , and .
Now, suppose . Then , so is an automorphism of and therefore . ∎
Proof of Theorem 4.5.
Let and be the rescaled transition matrices of and respectively. Reusing the same SVD notation, by Proposition 4.3 and 4.6, . Consider the linear assignment problem , which may be solved in time using the Hungarian algorithm (Kuhn, 1955). Again by Proposition 4.6
, this linear program is minimized at 0 and recoversas the unique minimizer. ∎
4.4 Symmetry with approximation
With finite sample complexity, we still know the base chain
exactly, but we get empirical estimates of the permuted chainby running trajectories. Specifically, samples are drawn from , with .
Call the empirical estimate , i.e. where counts the number of observed transitions and . And the empirical stationary distribution is where and . We can characterize the approximation error of the chain and stationary distribution as and respectively. Note these error terms are defined in the original state space .
Our goal is to use to produce a good policy in the target space. Say we predict the bijection is , and play the policy , whereas the correct policy in the target space is . We’d like to be able to control the imitation distance between these two policies when .
For that purpose, define , , where is the stationary distribution of , so these states will be visited “sufficiently” often. We first show correctness of the bijection on these states suffices for good imitation.
Lemma 4.7 (Policy Difference Lemma (Kakade and Langford, 2002)).
For two policies in the MDP defined by ,
where is the average advantage function.
Suppose for . Then .
First we decompose the objective
From the assumption and the definition of and , we have whenever . Equivalently, since is the stationary distribution of in the original space, and , we have whenever .
Note that implies for any . Hence,
following from the simple bound .
The bound in Theorem 4.8 depends on in a very discrete sense, controlled by the states where and agree. Say contains a single error, for . Then at we mistakenly play the action distribution , rather than the correct distribution . Because we never observe actions from the oracle, could be arbitrarily different at and , yielding a very suboptimal occupancy measure.
4.5 Approximate Symmetry Algorithm
In light of Theorem 4.8, an algorithm could either seek to recover exactly, or find a which agrees with on high occupancy states. We consider a learning algorithm for both objectives, and bound its sample complexity. The trick will be carefully setting the threshold that defines what constitutes high occupancy.
To state the theorem, we introduce the subscript notation to denote the principle submatrix defined by the indices of , and is the gap between the threshold and stationary values. Lastly, we define:
Definition 4.9 (Pseudospectral gap).
The pseudospectral gap of an ergodic Markov chain is , where
denotes the second largest eigenvalue.
If is not ergodic, we will take to mean the pseudospectral gap of restricted to the strongly connected components that intersect .
The policy learning algorithm in Algorithm 1 satisfies the following: for , , if is -friendly and then with probability at least , the output policy satisfies . In particular, if , .
The most important feature of this bound is the dependence on . In the sample complexity it only appears through a log term, and all other terms can be independent of depending on the choice of and the structural properties of . The error is still linear in , but this term appears necessary. If some occupancy mass leaves the well-supported states , it could cover all the negligible states, and either incur error linear in , or require exploration of every state and therefore sample complexity linear in .
Here we give the main ideas of the proof, full details are provided in the Appendix.
Remind that and . We also define as the empirical chain permuted back into the original MDP. Likewise define , and to be the diagonal of .
Given and , the immediate choice for an estimator of the rescaled transition matrix would be . However, this will not be well-defined if our samples don’t visit every state of . Furthermore, if is only ergodic when restricted to a subset of , then won’t be defined even with infinite sample complexity. Similarly, if is vanishingly small, will become prohibitively large in order to guarantee that is well-defined.
Our primary technical novelty addresses both these issues by setting a threshold on stationary mass, and discarding states below the threshold. Define and . We restate the notation that a subscript denotes taking the principle submatrix corresponding to or depending on the matrix’s domain. So for example, is restricted to rows and columns given by , and likewise is restricted to .
Several concentration results for empirical Markov chain transitions and stationary distributions control the convergence of our estimators (Wolfer and Kontorovich, 2019b, a). Our main assumption is that the gap is non-negligible. Then with high probability and sample complexity depending on but not , iff . In other words, no empirical stationary estimates will ”cross” the threshold, or put another way . We can then restrict our attention to the states above the threshold, such that the sample complexity necessary for concentration depends on but not (and only logarithmically on ).
For , the restricted rescaled transition matrix is well-defined. And with high probability we can define our estimator . Appealing to a strong friendliness assumption on , singular value perturbation inequalities imply that is also friendly.
Finally, the asymmetric properties of friendly matrices given in Proposition 4.6 enable exact recovery of the submatrix of restricted to the indices and . And by Theorem 4.8, determining the alignment on all high-occupancy states still yields a bound on the imitation loss.
5 Online Imitation
5.1 MDP Alignment
In the online setting, we’re still seeking to imitate , or equivalently . However, we no longer observe trajectories of the correct policy played in .
Instead, we are in a setting similar to a bandit, but without reward. At time , we play a policy defined on and observe a transition. We allow resets to the initial distribution. After plays, where
may be a random variable, we choose a final policyand receive instantaneous regret given by .
One simple algorithm might treat each possible bijection as an arm, where pulling is akin to running a trajectory using the policy , and then infer which alignment best matches the behavior policy. Or one could consider algorithms which don’t play policies of the form but simply explore the target space in a principled way.
Nevertheless, we derive a lower bound on the imitation loss of any algorithm in the online setting, demonstrating even complete knowledge of the source domain doesn’t trivialize third-person imitation.
5.2 Lower Bound Counterexample
Consider a small bandit-like MDP (Figure 1(a)). Red corresponds to action , blue corresponds to action , and purple corresponds to both. The numbers on the edges give transition probabilities when taking the associated action. Let the initial distribution be .
In other words, the initial state is either or . Starting at , the initial action is deterministic: playing leads to , playing leads to . Starting at the actions lead to the opposite states. Then the choice of action is irrelevant, and the transition to a terminal state is determined by at and at .
This characterizes . To introduce , let’s consider two possible bijections and , which correspond to the possible target MDPs in Figure 1(b) and Figure 1(c) (note the values of and are swapped given ). These correspond to two possible dynamics on our target space. is essentially the identity map, preserving states up to hats. Whereas and .
Finally, suppose the behavior policy we want to imitate in is defined by and . In other words, the agent always travels in the first step to . That means, under we want to travel to , and under we want to travel to . Intuitively, because is highly asymmetric, but the MDP is nearly symmetric, one cannot choose a policy that performs well in multiple permutations of the MDP. We formalize this intuition below.
Choose any positive values and , where and are universal constants, and let and . Consider any algorithm that achieves -optimal imitation loss on the above MDP with probability at least . Then for some .
Fix a policy , and we will write as simply .
Again use the variational form of total variation to say . Choose so that for and , and 0 elsewhere. Then a direct calculation gives .
Now we proceed by a reduction to multi-armed bandits with known biases. Consider a two-armed bandit with Bernoulli rewards, where the hypotheses for arm biases are and . We define the following algorithm for the two-armed bandit. First run algorithm on our MDP, where we couple pulls from arm 1 with transitions from and pulls from arm 2 with transitions from . Call the policy output by . Then output arm 1 if , otherwise arm 2.
Under hypothesis , , so by our assumptions on , with probability at least we have , which implies . Similar reasoning implies under , hence outputs the optimal arm with probability at least . Because , and the sample complexity of is lower bounded by that of , the result then follows from Theorem 13 in Mannor and Tsitsiklis (2004).
This bound illustrates why imitation is substantially more challenging than seeking high reward. In a regular RL problem with reward at the terminal states, if then the expected reward changes very slightly depending on the policy. But in the imitation setting, the value of and are essentially features of the states, which the agent must (very inefficiently) distinguish in order to achieve error lower than . Likewise, this counterexample captures why the online setting is the more challenging one studied in this work. In the offline regime, an oracle would only visit states on one half of the MDP and easily break the symmetry.
One may attribute this pessimistic bound to the choice of total variation distance. Indeed, among IPMs, total variation has very poor generalization properties (Sun et al., 2019)