SMODICE: Versatile Offline Imitation Learning via State Occupancy Matching

by   Yecheng Jason Ma, et al.

We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile algorithm for offline imitation learning (IL) via state-occupancy matching. We show that the SMODICE objective admits a simple optimization procedure through an application of Fenchel duality and an analytic solution in tabular MDPs. Without requiring access to expert actions, SMODICE can be effectively applied to three offline IL settings: (i) imitation from observations (IfO), (ii) IfO with dynamics or morphologically mismatched expert, and (iii) example-based reinforcement learning, which we show can be formulated as a state-occupancy matching problem. We extensively evaluate SMODICE on both gridworld environments as well as on high-dimensional offline benchmarks. Our results demonstrate that SMODICE is effective for all three problem settings and significantly outperforms prior state-of-art.



page 2

page 6

page 7

page 8

page 20

page 21


State Alignment-based Imitation Learning

Consider an imitation learning problem that the imitator and the expert ...

Strictly Batch Imitation Learning by Energy-based Distribution Matching

Consider learning a policy purely on the basis of demonstrated behavior—...

Offline Inverse Reinforcement Learning

The objective of offline RL is to learn optimal policies when a fixed ex...

Expert Q-learning: Deep Q-learning With State Values From Expert Examples

We propose a novel algorithm named Expert Q-learning. Expert Q-learning ...

Minimax Optimal Online Imitation Learning via Replay Estimation

Online imitation learning is the problem of how best to mimic expert dem...

Rethinking ValueDice: Does It Really Improve Performance?

Since the introduction of GAIL, adversarial imitation learning (AIL) met...

Off-Policy Imitation Learning from Observations

Learning from Observations (LfO) is a practical reinforcement learning s...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The offline reinforcement learning (RL) framework  (Lange et al., 2012; Levine et al., 2020) aims to use pre-collected, reusable offline data—without further interaction with the environment—for sample-efficient, scalable, and practical data-driven decision-making. However, this assumes that the offline dataset comes with reward labels, which may not always be possible. To address this, offline imitation learning (IL) (Zolna et al., 2020; Chang et al., 2021; Anonymous, 2022) has recently been proposed as an alternative where the learning algorithm is provided with a small set of expert demonstrations and a separate set of offline data of unknown quality. The goal is to learn a policy that mimics the provided expert data while avoiding test-time distribution shift (Ross et al., 2011) by using the offline dataset.

Expert demonstrations are often much more expensive to acquire than offline data; thus, offline IL benefits significantly from minimizing assumptions about the expert data. In this work, we aim to remove two assumptions about the expert data in current offline IL algorithms: (i) expert action labels must be provided for the demonstrations, and (ii) the expert demonstrations are performed with identical dynamics (same embodiment, actions, and transitions) as the imitator agent. These requirements preclude applications to important practical problem settings, including (i) imitation from observations, (ii) imitation with mismatched expert that obeys different dynamics or embodiment (e.g., learning from human videos), and (iii) learning only from examples of successful outcomes rather than full expert trajectories (Eysenbach et al., 2021).

Figure 1: Diagram of SMODICE. First, a state-based discriminator is trained using the offline dataset and expert observations (resp. examples) . Then, the discriminator is used to train the Lagrangian value function. Finally, the value function provides the importance weights for policy training, which outputs the learned policy .

For these reasons, many algorithms for online IL have already sought to remove these assumptions (Torabi et al., 2018, 2019; Liu et al., 2019; Radosavovic et al., 2020; Eysenbach et al., 2021), but extending them to offline IL remains an open problem.

We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a general offline IL framework that can be applied to all three problem settings described above. At a high level, SMODICE is based on a state-occupancy matching view of IL; in particular, it optimizes a tractable offline upper bound of the KL-divergence of the state-occupancy between the imitator and the expert :


This state-occupancy matching objective allows SMODICE to infer the correct actions from the offline data in order to match the state-occupancy of the provided expert demonstrations. This naturally enables imitation when expert actions are unavailable, and even when the expert’s embodiment or dynamics are different, as long as there is a shared task-relevant state. Finally, we show that example-based RL (Eysenbach et al., 2021), where only examples of successful states are provided as supervision, can be formulated as a state-occupancy matching problem between the imitator and a “teleporting” expert that is able to reach success states in one step. Hence, SMODICE can also be used as an offline example-based RL111We refer to this problem as “offline imitation learning from examples” to unify nomenclature with the other two problems. method without any modification.

Naively optimizing the offline upper bound on (1) would result in an actor-critic style IL algorithm akin to prior work (Ho and Ermon, 2016; Kostrikov et al., 2018, 2020); however, these algorithms suffer from training instability in the offline regime (Kumar et al., 2019; Lee et al., 2021; Anonymous, 2022) due to the entangled nature of actor and critic learning, leading to erroneous value bootstrapping (Levine et al., 2020). SMODICE bypasses this issue and achieves “actor-free” training by directly estimating the importance weight ratio of the occupancy measures between the optimal policy and the empirical behavior policy of the offline data, leveraging the stationary distribution correction estimation (Nachum et al., 2019a; Nachum and Dai, 2020)

(DICE) paradigm. Specifically, by formulating the policy optimization problem via its dual (i.e., optimizing over the space of valid state-action occupancy distributions) and applying Fenchel duality, SMODICE obtains an unconstrained convex optimization problem over a value function arising from Lagrangian duality, which admits closed-form solutions in the tabular case and can be easily optimized using stochastic gradient descent (SGD) in the deep RL setting. Then, SMODICE projects the optimal value function onto the dual space to extract the optimal importance weights, and learns the optimal policy via weighted Behavior Cloning. Note that SMODICE does not learn a policy until the value function has converged.

Through extensive experiments, we show that SMODICE is effective for all three problem settings we consider and outperforms all state-of-art methods in each respective setting. A benefit of the improved optimization is that SMODICE is substantially more stable compared to prior methods—we obtain all SMODICE results using a single

set of hyperparameters, modulo a choice of

-divergence which can be tuned offline. In contrast, prior methods suffer from much greater performance fluctuation across tasks and settings. Altogether, our proposed method SMODICE can serve as a versatile offline IL algorithm that is suitable for a wide range of assumptions on expert data.

In summary, our contributions are: (i) SMODICE: a simple, stable, and versatile state-occupancy matching based offline IL algorithm for both tabular and high-dimensional continuous MDPs, (ii) a reduction of example-based reinforcement learning to state-occupancy minimization, and (iii) extensive experimental analysis of SMODICE in offline imitation from observations, mismatched experts, and examples; in all three, SMODICE outperforms competing methods.

(a) Mismatched experts
(b) Offline IL from examples
Figure 2: Illustrations of tabular SMODICE for offline imitation learning from mismatched experts and examples.

Pedagogical examples. To illustrate SMODICE’s versatility, we have applied it to two gridworld tasks, testing offline IL from mismatched experts and examples, respectively. Figure 2(a) shows an expert agent that can move diagonally in any direction, whereas the imitator can only move horizontally or vertically. In Figure 2(b), only a success state (the star) is provided as supervision. An offline dataset collected by a random agent is given to SMODICE for training in both cases. As shown, SMODICE recovers an optimal policy (i.e. minimum state-occupancy divergence to that of the expert) in both cases. See Appendix D.1 for details.

2 Preliminaries

Markov decision processes.

We consider a time-discounted Markov decision process (MDP) 

(Puterman, 2014) with state space , action space , deterministic rewards , stochastic transitions , initial state distribution , and discount factor . A policy determines the action distribution conditioned on the state.

The state-action occupancies (also known as stationary distribution) of is


which captures the relative frequency of state-action visitations for a policy . The state occupancies then marginalize over actions: . The state-action occupancies satisfy the single-step transpose Bellman equation:


where is the adjoint policy transition operator,


Divergences and Fenchel conjugates. Next, we briefly introduce -divergence and their Fenchel conjugates.

Definition 1 (-divergence).

Given a continuous, convex function

and two probability distributions

over a domain , the -divergence of at is


A common -divergence in machine learning is the KL-divergence, which corresponds to . Now, we introduce Fenchel conjugate for -divergences.

Definition 2 (Fenchel conjugate).

Given a vector space

with inner-product , the Fenchel conjugate of a function is


For an -divergence, under mild realizability assumptions (Dai et al., 2016) on , the Fenchel conjugate of at is


Offline imitation learning. Many imitation learning approaches rely on minimizing the -divergence between the state-action occupancies of the imitator and the expert (Ho and Ermon, 2016; Ke et al., 2020; Ghasemipour et al., 2019):


In imitation learning, we do not have ; instead, we are provided with expert demonstrations .

In offline imitation learning, the agent further cannot interact with the MDP ; instead, they are given a static dataset of logged transitions , where each trajectory with ; we denote the empirical state-action occupancies of as .

3 The SMODICE Algorithm

SMODICE aims to minimize a state-occupancy matching objective based on the KL-divergence:


Later, we show that we can optimize any -divergence that upper bounds KL-divergence, as the conjugate of the KL-divergence may be numerically unstable.

Minimizing (10) requires on-policy samples from , as the expectation is over . To derive an offline objective, we first derive an upper bound involving the offline dataset.

First, we assume expert coverage of the offline data:

Assumption 1.

whenever .

This assumption ensures that the offline dataset has coverage over the expert state-marginal, and is necessary for imitation learning to succeed. Whereas prior offline RL approaches (Kumar et al., 2020; Ma et al., 2021b) assume full coverage of the state-action space, our assumption222Furthermore, it is not needed in practice, and is only required for our technical development to ensure that all state-occupancy quantities are well-defined (i.e., no division-by-zero). is considerably weaker since it only requires expert coverage. Given this assumption, we proceed to derive the offline upper bound on state-occupancy matching:

Theorem 1.

Given Assumption 1, we have


Furthermore, for any -divergence such that ,


Proofs are in Appendix A. Intuitively, this says that offline state-occupancy matching can be achieved by matching states in the offline data that resemble expert states (the first term), while remaining in the support of the offline state-action distribution (the second term). Replacing KL-divergence with other divergences can be useful since the conjugate of KL divergence involves a log-sum-exp, which has been found to be numerically unstable in many RL tasks (Zhu et al., 2020; Lee et al., 2021; Rudner et al., 2021).

Note that (12) requires samples from , so it still cannot be easily optimized without online interaction. To address this, we first rewrite it as an optimization problem over the space of valid state-action occupancies (Puterman, 2014):


where ; here, (14) ensures that is the occupancy distribution for some policy. We assume that (13) is strictly feasible.

Assumption 2.

There exists at least one such that constraints (14) are satisfied and .

This assumption is mild and can be satisfied in practice for any MDP for which every state is reachable from the initial state distribution. Next, we can form the dual of (13):


where are the Lagrangian multipliers. Now, because is the adjoint of , we have the following:


Using this equation, we can write (15) as


We note that the original problem (13) is convex (Lee et al., 2021). By Assumption 2, it is strictly feasible, so by strong duality, we can change the order of optimization in (17):


where . We briefly discuss how to compute . In the tabular case, can be computed using empirical estimates of and . In the continuous case, we can train a discriminator :


The optimal discriminator is  (Goodfellow et al., 2014), so we can use .

Finally, using the Fenchel conjugate, (18) can be reduced to a single unconstrained optimization problem over that depends on samples from only and not ; we also obtain the importance weight of the state-occupancy of the optimal policy with respect to the offline data.

Theorem 2.

The optimization problem (18) is equivalent to


Furthermore, given the optimal solution , the optimal state-occupancy importance weights are


This result can be viewed as using Fenchel duality to generalize prior DICE-based offline approaches (Lee et al., 2021; Anonymous, 2022). In particular, the inner maximization problem in (18) is precisely the Fenchel conjugate of at (compare (18) to (7)). Similarly, (21) can be derived from leveraging the relationship between the optimal solutions of a pair of Fenchel primal-dual problems. This generality allows us to choose problem-specific -divergences that improve stability during optimization. In Appendix C, we specialize the SMODICE objective for the KL- and -divergences, which we use in our experiments.

Finally, using the optimal importance weights, we can extract the optimal policy using weighted Behavior Cloning:


where . Here, can be viewed as the value function—it is trained by minimizing a convex function of the Bellman residuals and the values of the initial states. Then, it can be used to inform policy learning.

Putting everything together, SMODICE can achieve stable policy learning through a sequence of three disjointsupervised learning problems, summarized in Algorithm 1. The full pseudo-code is in Algorithm 3 in Appendix 3.

1:  // Discriminator Learning
2:  Train discriminator using (19)
3:  // Value Learning
4:  Train derived value function using (20)
5:  // Policy Learning
6:  Derive optimal ratios through (21)
7:  Train policy using weighted BC (22)
Algorithm 1 SMODICE

SMODICE for Tabular MDPs. An appealing property of SMODICE is that it admits closed-form analytic solution in the tabular case. The proof is given in Appendix D.

Theorem 3.

Let , and define and by and . Additionally, denote and . Then, choosing the -divergence in (20), we have


4 Offline Imitation Learning from Examples

Next, we describe how SMODICE can be applied to offline imitation learning from examples. Starting from the original problem objective from Eysenbach et al. (2021), we derive a state-occupancy matching objective, enabling us to apply SMODICE without any modification.

Problem setting. We assume given success examples , where indicates whether the current state is a success outcome, and offline data . Here, is the state distribution of the “user” providing success examples. Then,  Eysenbach et al. (2021) proposes the example-based RL objective


That is, we want a policy that maximizes the probability of reaching success states in the future. To tackle this problem in the offline setting, our strategy is to convert (

24) into an optimization problem over the state-occupancy space.

Intuition. By parameterizing the problem in terms of state occupancies, a policy that reaches success states in the future is one that has non-zero occupancies at these states—i.e., corresponds to a policy that reaches success states if for . Furthermore, treating success states as absorbing states in the MDP, then should ideally be much larger than (we validate this on gridworld; see Appendix D.1).

Derivation. We first transform the problem into state-occupancy space—i.e.,


which is valid given that the original objective can be thought of as a regular RL problem with reward function  (Eysenbach et al., 2021).

Given this formulation, we can derive a tractable lower bound to (25) through Jensen’s inequality and Bayes’ rule:

We can optimize the original objective by maximizing this lower bound. Doing so is equivalent to solving


which is exactly in the form of the state-occupancy matching objective (10) that SMODICE optimizes. Furthermore, this objective admits an intuitive explanation from a purely imitation learning lens. We can think of as the state-occupancy distribution of an expert agent who can “teleport” to any success state in one time-step. Therefore, we have shown that example-based RL can be understood as a state-occupancy minimization problem between a MDP-dynamics abiding imitator and a teleporting expert agent. Consequently, SMODICE can be used in the offline setting without any algorithmic modification.

5 Related Work

Offline imitation learning. The closest work is concurrent work, DEMODICE (Anonymous, 2022), a state-action based offline IL method, also using the DICE paradigm to estimates the occupancy ratio between the expert and the imitator; we overview the DICE literature in Appendix B. Due to its dependence on expert actions, DEMODICE cannot be applied to the three problem settings we study. At a technical level, a key limitation of DEMODICE is that it does not exploit the form of general Fenchel duality and only support the KL-divergence, forgoing other -divergences that can lead to more stable optimization (Ghasemipour et al., 2019; Ke et al., 2020; Zhu et al., 2020). Another related work is ORIL (Zolna et al., 2020), which adapts GAIL (Ho and Ermon, 2016) to the offline setting. Finally, there has been recent work learning a pessimistic dynamics model using the offline dataset and then performs imitation learning by minimizing the state-action occupancy divergence with respect to the expert inside this learned model (Chang et al., 2021). As with DEMODICE, this approach requires expert actions and cannot be applied to the settings we study.

(a) Mujoco
(b) AntMaze
(c) Franka Kitchen
Figure 3: Illustrations of the evaluation environments.

Imitation from observations, imitation with mismatched experts, and example-based RL All three of these problems have been studied in the online setting. IfO is often achieved through training an additional inverse dynamics model to infer the expert actions (Torabi et al., 2018, 2019; Liu et al., 2019; Radosavovic et al., 2020); in contrast, SMODICE matches the expert observations by identifying the correct actions supported in the offline data. To handle experts with dynamics mismatch, some work explicitly learns a correspondence between the expert and the imitator MDPs (Kim et al., 2020; Raychaudhuri et al., 2021); however, these approaches make much stronger assumptions on access to the expert MDP that are difficult to satisfy in the offline setting, such as demonstrations from auxillary tasks. In contrast, SMODICE falls under the category of state-only imitation learning (Liu et al., 2019; Radosavovic et al., 2020), which overcomes expert dynamics differences by only matching the shared task-relevant state space (e.g., coordinates for locomotion tasks). Finally, example-based RL was first studied in Eysenbach et al. (2021)

; they introduce a recursive-classifier based off-policy actor critic method to solve it. By casting this problem as state-occupancy matching between an imitator and a “teleporting” expert agent, SMODICE can solve the offline variant of this problem without modification.

6 Experiments

We experimentally demonstrate that SMODICE is effective for offline IL from observations, mismatched experts, and examples. We give additional experimental details in Appendices G, H, and I, and videos on the project website333Code is available at:

Figure 4: Offline imitation learning from observations results.

6.1 Offline Imitation Learning from Observations

Datasets. We utilize the D4RL (Fu et al., 2021) offline RL dataset. The dataset compositions for all tasks are listed in Table 3 in Appendix G. We consider the following standard Mujoco environments: Hopper, Walker2d, HalfCheetah, and Ant. For each, we take a single expert trajectory from the respective “expert-v2” dataset as the expert dataset and omit the actions. For the offline dataset, following Anonymous (2022), we use a mixture of small number of expert trajectories ( trajectories) and a large number of low-quality trajectories from the “random-v2” dataset (we use the full random dataset, consisting of around 1 million transitions). This dataset composition is particularly challenging as the learning algorithm must be able to successfully distinguish expert from low-quality data in the offline dataset.

We also include two more challenging environments from D4RL: AntMaze and Franka Kitchen. In AntMaze (Figure 3(b)), an Ant agent is tasked with navigating an U-shaped maze from one end to the other end (i.e., the goal region). The offline dataset (i.e., “antmaze-umaze-v2”) consists of trajectories ( 300k transitions) of an Ant agent navigating to the goal region from initial states; The trajectories are not always successful; often, the Ant flips over to its legs before it reaches the goal. We visualize this dataset on the project website. As above, we additionally include 1 million random-action transitions to increase the task difficulty. We take one trajectory from the offline dataset that successfully reaches the goal to be the expert trajectory. Franka Kitchen (Figure 3(c)), introduced by Gupta et al. (2019), involves controlling a 9-DoF Franka robot to manipulate common household kitchen objects (e.g., microwave, kettle, cabinet) sequentially to achieve a pre-specified configuration of objects. The dataset (i.e., “kitchen-mixed-v0”) consists of undirected human teleoperated demonstrations, meaning that each trajectory only solves a subset of the tasks. Together, these six tasks (illustrated in Figure 3) require scalability to high-dimensional state-action spaces and robustness to different dataset compositions.

Method and baselines. We use SMODICE with -divergence for all tasks (in other problem settings as well) except Hopper, Walker, and Halfcheetah, where we find SMODICE with KL-divergence to perform better; in Appendix E.2, we explain how to choose the appropriate -divergence offline by monitoring SMODICE’s policy loss. For comparisons, we consider both IfO and regular offline IL methods, which make use of expert actions. For the former, we compare against (i) SAIL-TD3-BC, which combines a state-of-art state-matching based online IL algorithm (SAIL) (Liu et al., 2019) with a state-of-art offline RL algorithm (TD3-BC) (Fujimoto and Gu, 2021),444We chose TD3-BC due to its simplicity and stability. (ii) Offline Reinforced Imitation Learning (ORIL) (Zolna et al., 2020), which adapts GAIL (Ho and Ermon, 2016) to the offline setting by using an offline RL algorithm for policy optimization; we implement ORIL using the same state-based discriminator as in SMODICE, and TD3-BC as the offline RL algorithm. For the latter, we consider the state-of-art DEMODICE (Anonymous, 2022) as well as Behavior Cloning (BC)

. We train all algorithms for 1 million gradient steps and keep track of the normalized score (i.e., 100 is expert performance, 0 is random-action performance) during training; the normalized score is averaged over 10 independent rollouts. All methods are evaluated over 3 seeds, and one standard-deviation confidence intervals are shaded.

Results. As shown in Figure 4, only SMODICE achieves stable and good performance in all six tasks. It achieves (near) expert performance in all the Mujoco environments, performing on-par with DEMODICE and doing so without the privileged information of expert actions. SMODICE’s advantage over DEMODICE is more apparent in AntMaze and Kitchen. In the former, SMODICE outperforms BC, while DEMODICE cannot; in the latter, DEMODICE quickly collapses due to its use of KL-divergence, which may be numerically unstable in high-dimensional environments. BC is a strong baseline for tasks where the offline dataset contains (near) expert data (i.e., AntMaze and Kitchen); however, as the dataset becomes more diverse, BC’s performance drops significantly.

SAIL-TD3-BC and ORIL both fail to learn in some environments and otherwise converge to a worse policy than SMODICE. The only exception is AntMaze; however, in Appendix G.2, we show that both methods collapse with a more diverse version of the AntMaze offline dataset, indicating that unlike SMODICE, these methods are highly sensitive to the composition of the offline dataset, and work best with task-aligned offline data. The sub-par performances of SAIL and ORIL highlight the challenges of adapting online IL methods to the offline setting; we hypothesize that it is not sufficient to simply equip the original methods (i.e., SAIL and GAIL) with a strong base offline RL algorithm. Together, these results demonstrate that SMODICE is stable, scalable, and robust, and significantly outperforms prior methods. Finally, in Appendix G.2, we ablate SMODICE by zeroing out its discriminator-based reward to validate that SMODICE’s empirical performance comes from its ability to discriminate expert data in the offline dataset.

Figure 5: Offline imitation learning from mismatched experts results.
Figure 6: Offline imitation learning from examples results.

6.2 Offline IL from Mismatched Experts

Datasets and baselines. We compare SMODICE to SAIL-TD3-BC and ORIL, which are both state-based offline IL methods; in particular, we note that SAIL is originally designed to be robust to mismatched experts. We consider only tasks in which both SAIL-TD3-BC and ORIL obtained non-trivial performance, including HalfCheetah, Ant, and AntMaze. Then, for each environment, we train a mismatched expert and collect one expert trajectory, replacing the original expert trajectory used in Section 6.1. The mismatched experts for the respective tasks are (i) “HalfCheetah-Short”, where the torso of the cheetah agent is halved in length, (ii) “Ant-Disabled”, where the front legs are shrank by a quarter in length, and (iii) a 2D PointMass agent operating in the same maze configuration. The mismatched experts are illustrated in Figure 11 in Appendix H and the project website. For the first two, we train an expert policy using SAC (Haarnoja et al., 2018) and collect one expert trajectory. The latter task is already in D4RL; thus, we take one trajectory from “maze2d-umaze-v0” as the expert trajectory. Because Ant and PointMass have different state spaces, following Liu et al. (2019), we train the discriminator on the shared -coordinates of the two state spaces. The offline datasets are identical to the ones in Section 6.1.

Results. The training curves are shown in Figure 5; we illustrate the original maximum performance attained by each method (i.e., using the original expert trajectory, Section 6.1) using dashed lines as points of reference. As can be seen, SMODICE is significantly more robust to mismatched experts than either SAIL-TD3-BC or ORIL. On AntMaze, the task where SAIL-TD3-BC and ORIL originally outperform SMODICE, learning from a PointMass expert significantly deteriorates their performances, and the learned policies are noticably worse than that of SMODICE, which has the smallest performance drop. The other two tasks exhibit similar trends; SMODICE is able to learn an expert level policy for the original Ant embodiment using a disabled Ant expert, and is the only method that shows any progress on the hardest HalfCheetah-Short task. Despite using the same discriminator for reward supervision, SMODICE is substantially more robust than ORIL, likely due to the occupancy-constraint term in its objective (12), which ensures that the learned policy is supported by the offline data as it attempts to match the expert states. On the project website, we visualize SMODICE and ORIL policies on all tasks. In Appendix H.2, we provide additional quantitative analysis of Figure 5.

6.3 Offline Imitation Learning from Examples

Tasks. We use the AntMaze and Kitchen environments and create example-based task variants. For AntMaze, we replace the full demonstration with a small set of success states (i.e., Ant in the goal region) extracted from the offline data. For Kitchen, we consider two subtasks in the environment: Kettle and Microwave. and define task success to be only whether the specified object is correctly placed (instead of all objects as in the original task); the success states are extracted from the offline data accordingly. Examples of the success states are illustrated in Figure 13 in Appendix I. Note that the kitchen dataset contains many trajectories where the kettle is moved first. Thus, the kettle task is easy even for Behavior Cloning (BC), since cloning the offline data can lead to success. This is not the case for the microwave task, making it much more difficult to solve using only success examples. In addition, we introduce the PointMass-4Direction environment. Here, a 2D PointMass agent is tasked with navigating to the middle point of a specified edge of the square that encloses the agent (see Figure 13(a)). The offline dataset is generated using a waypoint navigator controlling the agent to each of the four possible goals and contains equally many trajectories for each goal; we visualize this dataset on the project website. At training and evaluation time, we set the left edge to be the desired edge and collect success states from the offline data accordingly. This task is low-dimensional but consists of multi-task offline data, making it challenging for algorithms such as BC that do not solve the example-based RL objective.

Figure 7: SMODICE weights.

Approaches. We make no modification to SMODICE; the only difference is that the discriminator is trained using success states instead of full expert state trajectories. Our main comparison is RCE-TD3-BC, which combines RCE (Eysenbach et al., 2021), the state-of-art online example-based RL method, and TD3-BC. We also compare against ORIL (Zolna et al., 2020), using the same architecture as in Section 6.1. Finally, we also include BC.

Results. As shown in Figure 6, SMODICE is the best performing method on all four tasks and is the only one that can solve the Microwave task; we visualize all methods’ policies on all tasks on the project website. RCE-TD3-BC is able to solve the first three tasks, but achieves worse solutions and exhibits substantial performance fluctuation during training; we posit that the optimization for RCE, which requires alternate updates to a recursive classifier and a policy, is substantially more difficult than that of SMODICE. ORIL is unstable and fails to make progress in most tasks. Interestingly, as in the mismatched expert setting, on AntMaze, ORIL’s performance is far below that of SMODICE, despite attaining better results originally (Figure 4). This comparison demonstrates the versatility of SMODICE afforded by its state-occupancy matching objective; in contrast, ORIL treats offline IL from examples as an offline RL task with discriminator-based reward and cannot solve the task.

To better understand SMODICE, on PointMass-4Direction, we visualize the importance weights it assigns to the offline dataset. As shown in Figure 7, SMODICE assigns much higher weights to transitions along the correct path from the initial state region to the success examples. Interestingly, the weights progressively decrease along this path, indicating that SMODICE has learned that it must pay more attention transitions at the beginning of the path, since making a mistake there is more likely to derail progress towards the goal. This behavior occurs automatically via SMODICE’s state-matching objective without any additional bias.

7 Conclusion

We have proposed SMODICE, a simple, stable, and versatile algorithm for offline imitation learning from observations, mismatched experts, and examples. Leveraging Fenchel duality, SMODICE optimizes a state-occupancy matching objective that enjoys closed-form tabular solution and stable optimization with deep neural networks. Through extensive experiments, we have shown that SMODICE significantly outperforms prior state-of-art methods in all three settings. We believe that the generality of SMODICE invites many future work directions, including offline model-based RL 

(Yu et al., 2020; Kidambi et al., 2020), safe RL (Ma et al., 2021a), and extending it to visual domains.


We thank members of Penn Perception, Action, and Learning group for their feedback. This work is funded in part by an Amazon Research Award, gift funding from NEC Laboratories America, NSF Award CCF-1910769, NSF Award CCF-1917852 and ARO Award W911NF-20-1-0080. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.


  • Anonymous (2022) DemoDICE: offline imitation learning with supplementary imperfect demonstrations. In Submitted to The Tenth International Conference on Learning Representations, Note: under review External Links: Link Cited by: Appendix B, §1, §1, §3, §5, §6.1, §6.1.
  • L. Baird (1995) Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37. Cited by: Appendix E.
  • S. Boyd, S. P. Boyd, and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: Example 2.
  • J. D. Chang, M. Uehara, D. Sreenivas, R. Kidambi, and W. Sun (2021) Mitigating covariate shift in imitation learning via offline data without great coverage. External Links: 2106.03207 Cited by: §1, §5.
  • B. Dai, N. He, Y. Pan, B. Boots, and L. Song (2016) Learning from conditional distributions via dual embeddings. External Links: 1607.04579 Cited by: Definition 2.
  • B. Dai, O. Nachum, Y. Chow, L. Li, C. Szepesvári, and D. Schuurmans (2020) Coindice: off-policy confidence interval estimation. arXiv preprint arXiv:2010.11652. Cited by: Appendix B.
  • B. Eysenbach, S. Levine, and R. Salakhutdinov (2021) Replacing rewards with examples: example-based policy search via recursive classification. In Thirty-Fifth Conference on Neural Information Processing Systems, External Links: Link Cited by: §1, §1, §1, §4, §4, §4, §5, §6.3.
  • J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2021) D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219 Cited by: §6.1.
  • S. Fujimoto and S. S. Gu (2021) A minimalist approach to offline reinforcement learning. arXiv preprint arXiv:2106.06860. Cited by: Table 2, Appendix F, §6.1.
  • S. K. S. Ghasemipour, R. Zemel, and S. Gu (2019) A divergence minimization perspective on imitation learning methods. External Links: 1911.02256 Cited by: §2, §5.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. External Links: 1406.2661 Cited by: §3.
  • A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman (2019) Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956. Cited by: §6.1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. External Links: 1801.01290 Cited by: §E.1, §6.2.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Appendix D.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. External Links: 1606.03476 Cited by: §1, §2, §5, §6.1.
  • L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa (2020) Imitation learning as -divergence minimization. External Links: 1905.12888 Cited by: §2, §5.
  • R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims (2020) Morel: model-based offline reinforcement learning. arXiv preprint arXiv:2005.05951. Cited by: §7.
  • K. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon (2020) Domain adaptive imitation learning. In International Conference on Machine Learning, pp. 5286–5295. Cited by: §5.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 1, Table 2.
  • I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson (2018) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925. Cited by: §1.
  • I. Kostrikov, O. Nachum, and J. Tompson (2020) Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B, §1.
  • A. Kumar, J. Fu, G. Tucker, and S. Levine (2019) Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: §1.
  • A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §3.
  • S. Lange, T. Gabel, and M. Riedmiller (2012) Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Cited by: §1.
  • J. Lee, W. Jeon, B. Lee, J. Pineau, and K. Kim (2021) OptiDICE: offline policy optimization via stationary distribution correction estimation. arXiv preprint arXiv:2106.10783. Cited by: Appendix B, Appendix E, §1, §3, §3, §3.
  • S. Levine, A. Kumar, G. Tucker, and J. Fu (2020) Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643. Cited by: §1, §1.
  • F. Liu, Z. Ling, T. Mu, and H. Su (2019) State alignment-based imitation learning. arXiv preprint arXiv:1911.10947. Cited by: §1, §5, §6.1, §6.2.
  • Y. J. Ma, A. Shen, O. Bastani, and D. Jayaraman (2021a) Conservative and adaptive penalty for model-based safe reinforcement learning. arXiv preprint arXiv:2112.07701. Cited by: §7.
  • Y. Ma, D. Jayaraman, and O. Bastani (2021b) Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems 34. Cited by: §3.
  • O. Nachum, Y. Chow, B. Dai, and L. Li (2019a) Dualdice: behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: Appendix B, §1.
  • O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans (2019b) AlgaeDICE: policy gradient from arbitrary experience. External Links: 1912.02074 Cited by: Appendix B, Appendix E.
  • O. Nachum and B. Dai (2020) Reinforcement learning via fenchel-rockafellar duality. External Links: 2001.01866 Cited by: §A.3, Appendix E, §1.
  • M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2, §3.
  • I. Radosavovic, X. Wang, L. Pinto, and J. Malik (2020) State-only imitation learning for dexterous manipulation. External Links: 2004.04650 Cited by: §1, §5.
  • D. S. Raychaudhuri, S. Paul, J. van Baar, and A. K. Roy-Chowdhury (2021) Cross-domain imitation from observations. External Links: 2105.10037 Cited by: §5.
  • S. Ross, G. J. Gordon, and J. A. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. External Links: 1011.0686 Cited by: §1.
  • T. G. J. Rudner, C. Lu, M. Osborne, Y. Gal, and Y. W. Teh (2021) On pathologies in KL-regularized reinforcement learning from expert demonstrations. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §3.
  • F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. External Links: 1805.01954 Cited by: §1, §5.
  • F. Torabi, G. Warnell, and P. Stone (2019) Generative adversarial imitation from observation. External Links: 1807.06158 Cited by: §1, §5.
  • C. Yang, X. Ma, W. Huang, F. Sun, H. Liu, J. Huang, and C. Gan (2019) Imitation learning from observations by minimizing inverse dynamics disagreement. arXiv preprint arXiv:1910.04417. Cited by: §A.1.
  • T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma (2020) Mopo: model-based offline policy optimization. arXiv preprint arXiv:2005.13239. Cited by: §7.
  • R. Zhang*, B. Dai*, L. Li, and D. Schuurmans (2020) GenDICE: generalized offline estimation of stationary values. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B.
  • Z. Zhu, K. Lin, B. Dai, and J. Zhou (2020) Off-policy imitation learning from observations. Advances in Neural Information Processing Systems 33. Cited by: Appendix B, §3, §5.
  • K. Zolna, A. Novikov, K. Konyushkova, C. Gulcehre, Z. Wang, Y. Aytar, M. Denil, N. de Freitas, and S. Reed (2020) Offline learning from demonstrations and unlabeled experience. arXiv preprint arXiv:2011.13885. Cited by: §1, §5, §6.1, §6.3.

Appendix A Proofs

a.1 Technical Lemmas

Lemma 1.

We have


We first state and prove a related lemma, which first appeared in (Yang et al., 2019).

Lemma 2.

Using this result, we can show the desired upper bound:

a.2 Proof of Theorem 1


where the last step follows from Lemma 1. Then, for any , we have that

a.3 Proof of Theorem 2


We begin with


We have that


where the last step follows from recognizing that the inner-maximization is precisely the Fenchel conjugate of at .

To show the relationship among and , we recognize that (30) and (13) are a pair of Fenchel primal-dual problems.

Lemma 3.

is the Fenchel dual to


We define the indicator function as

Then, we define as . Then, it can be shown that the Fenchel conjugate of is . In addition, we denote ; then, . Finally, define matrix operator . Using these notations, we can write (30) as


Then, we proceed to derive the Fenchel dual of (33):


where (35) follows applying Fenchel conjugacy to , (37) follows from strong duality, (38) follows from the property of an adjoint operator, and (40) follows from applying Fenchel conjugacy to . Here, we recognize that (40) is precisely the optimization problem (31)-(32), where we have moved the constraint (32) to the objective as the indicator function :

Giving Lemma 3, we use the fact that and admit the following relationship:


This follows from the characterization of the optimal solutions for a pair of Fenchel primal-dual problems with convex and linear operator  (Nachum and Dai, 2020). In this case, assuming that we can exchange the order of expectation and derivative (e.g, conditions of Dominated Convergence Theorem hold), we have


or equivalently,


as desired. ∎

Appendix B Extended Related Work

Stationary distribution correction estimation. Estimating the optimal policy’s stationary distribution using off-policy data was introduced by (Nachum et al., 2019a) as the DICE trick. This technique has been shown to be effective for off-policy evaluation (Nachum et al., 2019a; Zhang* et al., 2020; Dai et al., 2020), policy optimization (Nachum et al., 2019b; Lee et al., 2021), online imitation learning (Kostrikov et al., 2020; Zhu et al., 2020), and concurrently, offline imitation learning (Anonymous, 2022). Within the subset of DICE-based policy optimization methods, none has tackled state-occupancy matching or directly apply Fenchel Duality to its full generality to arrive at the form of value function objective we derive.

Appendix C SMODICE with common -divergences

Example 1 (SMODICE with -divergence).

Suppose , corresponding to -divergence. Then, we can show that and . Hence, the SMODICE objective amounts to



Example 2 (SMODICE with KL-divergence).

We have . Using the fact that the conjugate of the negative entropy function, restricted to the probability simplex, is the log-sum-exp function (Boyd et al., 2004), it follows that . Hence, the KL-divergence SMODICE objective is