Off-policy Imitation Learning from Visual Inputs

11/08/2021
by   Zhihao Cheng, et al.
The University of Sydney
0

Recently, various successful applications utilizing expert states in imitation learning (IL) have been witnessed. However, another IL setting – IL from visual inputs (ILfVI), which has a greater promise to be applied in reality by utilizing online visual resources, suffers from low data-efficiency and poor performance resulted from an on-policy learning manner and high-dimensional visual inputs. We propose OPIfVI (Off-Policy Imitation from Visual Inputs), which is composed of an off-policy learning manner, data augmentation, and encoder techniques, to tackle the mentioned challenges, respectively. More specifically, to improve data-efficiency, OPIfVI conducts IL in an off-policy manner, with which sampled data can be used multiple times. In addition, we enhance the stability of OPIfVI with spectral normalization to mitigate the side-effect of off-policy training. The core factor, contributing to the poor performance of ILfVI, that we think is the agent could not extract meaningful features from visual inputs. Hence, OPIfVI employs data augmentation from computer vision to help train encoders that can better extract features from visual inputs. In addition, a specific structure of gradient backpropagation for the encoder is designed to stabilize the encoder training. At last, we demonstrate that OPIfVI is able to achieve expert-level performance and outperform existing baselines no matter visual demonstrations or visual observations are provided via extensive experiments using DeepMind Control Suite.

READ FULL TEXT VIEW PDF
01/21/2020

Loss-annealed GAIL for sample efficient and stable Imitation Learning

Imitation learning is the problem of learning a policy from an expert po...
03/26/2017

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations

The goal of imitation learning is to mimic expert behavior without acces...
09/09/2019

Expert-Level Atari Imitation Learning from Demonstrations Only

One of the key issues for imitation learning lies in making policy learn...
12/10/2019

Imitation Learning via Off-Policy Distribution Matching

When performing imitation learning from expert demonstrations, distribut...
03/25/2021

Adversarial Imitation Learning with Trajectorial Augmentation and Correction

Deep Imitation Learning requires a large number of expert demonstrations...
05/23/2022

Data augmentation for efficient learning from parametric experts

We present a simple, yet powerful data-augmentation technique to enable ...
01/02/2021

SDA: Improving Text Generation with Self Data Augmentation

Data augmentation has been widely used to improve deep neural networks i...

1 Introduction

Imitation learning (IL) empowers agents to learn from expert data instead of designing an explicit reward function (Ho & Ermon, 2016) and has achieved remarkable successes in graphics (Yuan & Kitani, 2018), online games (Vinyals et al., 2019), robotic manipulation (Fang et al., 2019), and saliency prediction (Xu et al., 2021). The expert data that IL uses can be divided into two categories (Goo & Niekum, 2019; Wake et al., 2020), demonstrations and observations, among which demonstrations contain states and actions of experts’ experiences, whereas observations only consist of states. In real world applications, the state is the proprioceptive state of an expert, which could be hard to access and record. By contrast, intelligent creatures grasp knowledge or skills by observing how peer fellows accomplish tasks without knowing their proprioceptive states (Douglas Greer et al., 2006). In other words, intelligent creatures generally learn with visual inputs rather than state inputs. This learning scheme is more practical but has been less studied in IL community.

The thought to enable agents to learn like intelligent creatures lead to a spectrum of IL, imitation learning from visual inputs (or images, or pixels) (ILfVI). This spectrum of IL is also referred to as visual imitation learning (VIL) (Young et al., 2020; Rafailov et al., 2021)

. Here, we denote it as ILfVI to emphasize that visual inputs can be further classified. Corresponding to traditional state demonstrations and observations, visual inputs fall into visual demonstrations and visual observations, where the former contains images and actions while the latter merely includes images. In contrast to dramatic successes in IL from state inputs

(Ho & Ermon, 2016; Torabi et al., 2018b; Zhang et al., 2020), there is only a few research (Li et al., 2017; Torabi et al., 2018b) on ILfVI. Furthermore, the performance of ILfVI is still far from satisfactory to be applied in reality.

Compared to state inputs, the only difference in ILfVI is that images are partially-observed high-dimensional inputs. This difference introduces several challenges: 1) visual inputs are partially observed from states, which converts underlying dynamics from Markov Decision Process (MDP)

(Bellman, 1957) into Partially Observable MDP (POMDP) (Kaelbling et al., 1998); 2) high-dimensional inputs such as images contain a large portion of redundant information (Ramachandran et al., 2019), distracting agents from extracting useful information for decision-making; 3) the adversarial training paradigm in IL suffers from training instability with high-dimensional inputs (Goodfellow et al., 2014). In addition, agents in ILfVI need to learn to extract meaningful features from high-dimensional inputs, further aggravating low data-efficiency in an on-policy training manner. Despite of these challenges, it is significant to improve the performance of ILfVI because this IL setting extremely expands promising application domains.

In this paper, considering the challenges in ILfVI, we propose the algorithm OPIfVI (Off-policy Imitation from Visual Inputs) to improve both data-efficiency and performance. To be more data-efficient, we build OPIfVI in an off-policy manner, where sampled data are stored in a replay buffer such that the data can be utilized multiple times. We adopt spectral normalization further to enhance the stability of this off-policy training scheme. Besides, we borrow the idea of data augmentation from computer vision to help encoders to extract meaningful features from images. Then, the extracted features are forwarded to agents to make decisions. Data augmentation enlarges sampled data and can benefit data-efficiency to some extent. Furthermore, we design a specific structure for the gradient backpropagation to train and stabilize encoders. In this structure, the actor and critic in the generator share an encoder, while the discriminator maintains an independent encoder. The encoder in the generator is updated with only gradients backpropagated from the critic, whereas the other encoder is trained with the discriminator loss. Combing these three aspects, we propose the algorithm OPIfVI, which can efficiently and effectively learn from visual inputs. We evaluate OPIfVI’s ability to reproduce expert policies with visual demonstrations and visual observations on various DeepMind Control tasks (Tassa et al., 2018), showing that OPIfVI significantly surpasses the other baselines in terms of both data-efficiency and performance. Finally, ablation studies are carried out to investigate the impacts of modifications that we adopt.

The reminder of this work is organized as follows. We present the related work in Section 2 and introduce necessary background knowledge in Section 3. Then, our method OPIfVI is thoroughly described in Section 4. Section 5 empirically demonstrates the performance of OPIfVI, and Section 6 concludes the paper.

2 Related Work

Imitation Learning from State Inputs.

Imitation Learning (IL) allows reproducing policies that can imitate expert behaviors merely relying on expert data. IL algorithms can be split into different classes from various perspectives. For example, according to the learning mechanism, IL can be divided into behavioral cloning (BC) (Bain & Sammut, 1995)

and inverse reinforcement learning (IRL)

(Abbeel & Ng, 2004)

. BC takes IL as pure supervised learning, while IRL first reconstructs a reward function with expert data and then conducts ordinary RL with the reconstructed reward function. For more works, please refer to

Ho & Ermon (2016); Fu et al. (2017); Torabi et al. (2018b); Zhang et al. (2020); Dadashi et al. (2020); Jaegle et al. (2021) and their references therein. From the perspective of expert data, IL can be categorized into learning from demonstration (LfD) and learning from observation (LfO) (Goo & Niekum, 2019; Wake et al., 2020). Demonstrations contain both states and actions of experts, whereas observations only consist of expert states. Torabi et al. (2018a) present behavioral cloning from observation (BCO) by combining BC with an inverse dynamics model to forecast expert actions. Torabi et al. (2018b)

develop generative adversarial imitation from observation (GAIfO) that utilizes state transitions rather than state-action pairs to generate rewards, extending GAIL to the LfO setting. Most of these IRL algorithms employ an on-policy learning manner to maintain accurate estimations of occupancy measures

(Ho & Ermon, 2016; Torabi et al., 2018b), which results in low data-efficiency.

Imitation Learning from Visual Inputs.

ILfVI is attracting more attention owing to its broad application prospects. However, there is only a few research on ILfVI. Li et al. (2017) propose InfoGAIL that can deal with visual demonstrations sampled from diverse experts by learning latent variables from expert data. To cope with visual demonstrations, Young et al. (2020) enhance BC with data augmentation and develop visual behavior cloning (VBC). Torabi et al. (2018b) conduct experiments of GAIfO with visual observations, showing that GAIfO only achieves about half of the expert-level performance in no-trivial environments. Two concurrent works (Rafailov et al., 2021; Anonymous, 2022) give further insights into ILfVI. Rafailov et al. (2021) solve ILfVI from a model-based perspective, whose algorithm V-MAIL first learns a world model and then updates the discriminator with on-policy samples from the learned model. Anonymous (2022) study IL from visual observations with a model-free scheme. They build their IL algorithms based on the encoder in DrQ-v2 (Yarats et al., 2021) to help extract features and achieve expert-level performance. Another spectrum of researchers focus on ILfVI in the domain of robot control (Yu et al., 2018; Pathak et al., 2018). They aim to achieve few-shot or even zero-shot IL, and expert data are provided with time labels, which is distinct from our setting.

Data Augmentation.

In computer vision (CV), data augmentation is one of the basic techniques for the majority of tasks such as classification (Perez & Wang, 2017), detection (Zoph et al., 2020), and recognition (Lv et al., 2017). Data augmentation has long been studied in CV (Fawzi et al., 2016; Cubuk et al., 2019; Tran et al., 2021), which helps enlarge dataset and extract useful features. However, for state inputs, data augmentation is rarely adopted in RL or IL (Yarats et al., 2020) because the state in MDP is unique and any modification will lead to a state that represents different information compared to the unmodified one. Sinha & Garg (2021) study how to augment states in RL by adding small noises to sampled data. For visual inputs, recently, data augmentation has been employed in RL and significantly improves the performance (Yarats et al., 2019, 2020; Laskin et al., 2020; Seo et al., 2021). For example, Laskin et al. (2020) illustrate that general data augmentation methods enable agents to achieve excellent performance in RL from visual inputs with extensive studies. In ILfVI, data augmentation also demonstrates remarkable performance gains (Young et al., 2020; Rafailov et al., 2021).

Our work bears some resemblance to the two concurrent works Rafailov et al. (2021); Anonymous (2022). Compared to Rafailov et al. (2021), we solve ILfVI from a model-free perspective, which improves the performance with merely off-policy samples instead of a learned model and even works for visual observations. In contrast to Anonymous (2022), we dynamically calculate rewards with the latest discriminator and employ spectral normalization to stabilize the learning process, which surpasses the former in terms of both data-efficiency and performance.

3 Preliminaries

Markov Decision Process (MDP) (Bellman, 1957).

We consider an MDP described by a tuple , with the state space , the action space , the transition distribution , the reward function , and the discount factor . We denote a stochastic policy for the agent as , where and . A trajectory can be obtained via interactions between policy and the environment, where is the current timestep, the initial state

is sampled from the probability distribution

, , and . Then, the expected discounted reward is . RL algorithms are supposed to find the optimal policy , which can achieve the maximum episode cumulative reward .

When we use visual inputs to control agents, the MDP formulation turns into POMDP (Yarats et al., 2020). Compared to MDP, POMDP can be formulated with a 7-tuple , where the two additional elements and

is the space of observations and the probability that

, respectively. Agents can merely receive partially observed information instead of state in POMDP, increasing the difficulty of making decisions. A routine solution to deal with the partial observability is to stack several adjacent visual inputs together and then regard it as a state (Mnih et al., 2015; Laskin et al., 2020; Yarats et al., 2020). In the paper, we follow this routine to cope with visual inputs and then carry out IL.

Soft Actor-Critic (SAC) (Haarnoja et al., 2018).

We introduce the off-policy RL algorithm SAC, which maintains a policy , two Q-value functions and , and a temperature parameter . SAC uses samples from a replay buffer and updates the Q-value functions to minimize the following objectives

(1)

where , is the done signal ( if an episode terminates; otherwise ) and . is defined as the target Q-value regarding and its parameters can be obtained via exponential moving average of . The policy parameters are updated to maximize

(2)

The temperature that helps improve stability is adjusted by minimizing

(3)

where is the target minimum entropy.

Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016).

GAIL adopts the framework of GAN (Generative Adversarial Network) training

(Goodfellow et al., 2014) and can be formalized as a minimax problem:

(4)

in which is the discriminator that measures the similarity between agent’s state-action pair and the expert one, is the parameter of the discriminator, and is the entropy of policy with weight .

Generative Adversarial Imitation from Observation (GAIfO) (Torabi et al., 2018b).

GAIfO is extended from GAIL to learn from expert data with the absence of actions. The only difference lying between the two approaches is that GAIfO uses state-state pairs, whereas GAIL utilizes state-action pairs. GAIfO is formalized as follows

(5)

When using single states instead of state-state pairs, GAIfO degrades to GAIfO-s as follows (Yang et al., 2019),

(6)

4 Off-policy Imitation from Visual Inputs

Figure 1: The framework of OPIfVI. The main components in OPIfVI are a replay buffer, an expert data set, a data augmentation block (abbreviated as AUG), an encoder , a policy , the Q-value function , and a discriminator

. The solid lines with arrows denote directions of information flow, while the colored dotted lines represent gradients backpropagated from loss functions.

Here, we describe our off-policy imitation algorithm for learning from visual inputs, OPIfVI, which is presented in Figure 1. We begin by formalizing the off-policy ILfVI problem and introducing the challenges responsible for low data-efficiency and poor performance of current IL algorithms. To alleviate these challenges, OPIfVI adopts three major modifications compared to previous works: 1) an off-policy imitation learning paradigm with enhanced stability (Section 4.1

); 2) data augmentation for better feature extraction (Section

4.2); 3) a specifically designed gradients backpropagation scheme for the training of encoders (Section 4.3). By virtue of the three modifications, our algorithm–OPIfVI is able to primely achieve expert-level performance in ILfVI with high data-efficiency, which surpasses the other baselines.

Problem Formulation.

Two major differences are lying between our ILfVI setting and the previous ones (Ho & Ermon, 2016; Torabi et al., 2018b). The first is that agents only receive high-dimensional partially-observed images instead of low-dimensional fully-observed states. Although we can stack several consecutive images into and roughly regard as a state, the high-dimensional inputs still make it challenging to imitate. We consider the problem of learning from visual demonstrations and observations, i.e., the expert data are or , respectively. Furthermore, we also study a degraded setting in visual observations, where only single observations rather than neighboring observation pairs are provided such that . To unify the three kinds of expert data, we define a symbol such that , , or . The remaining difference is that we use off-policy samples to conduct IL. With off-policy samples, the source distribution of samples for updating parameters would substantially change, increasing training instability. Our ILfVI problem is formalized as follows:

(7)

where is the distribution of in a replay buffer . Samples in the replay buffer are recorded from historical policies, which is considered off-policy.

Algorithm Overview.

The framework of OPIfVI is presented in Figure 1. Similar to SAC, there are four Q-value functions in OPIfVI, including two alternate Q-value functions ( and ) and two target ones ( and ). For simplicity, we denote them with a unified symbol in Figure 1. The policy and Q-value functions share an identical encoder

that is constructed with convolutional neural networks (CNNs) and is used to extract features from images. With interactions between the policy

and the environment, we collect samples and store them into a replay buffer . For training, we first randomly sample data from the replay buffer and augment sampled images with random crop. Then the augmented data are used to calculate rewards with the current discriminator . Subsequently, we input the data into the encoder followed by the Q-value function as well as the policy . According to the losses in Algorithm 1, we backpropagate gradients to update the parameters , , . Note that the encoder is only trained with gradients from Q-value functions. The discriminator maintains a separate encoder with the same structure to the one of the policy and Q-value functions. The discriminator uses samples from the replay buffer and expert data set to update its parameters with the loss in Algorithm 1. Images input to the discriminator are also augmented and spectral normalization is adopted to enhance the stability of discriminator training.

Inputs: Expert trajectories .
Hyperparameters: Total iteration number , replay buffer size , initial number of samples , image augmentation AUG, minibatch size , learning rate , polyak averaging for target networks , discount factor , target minimum entropy , and temperature .
Parameters: Denote the encoder as , policy as , Q-values and target Q-values as and (), discriminator . The parameters of each block are denoted by its subscript.
Initialize Replay Buffer Randomly sample transitions
for  to  do
      Sample transitions from replay buffer
      with Compute rewards with discriminator
      Sample expert data
      Data augmentation
      Extract features with the encoder
     
      Update encoder and Q-value
      Update target Q-value functions
      Update policy
      Update temperature
      Update discriminator
      Sample a new transition to replay buffer
end for
Algorithm 1 Off-policy Imitation from Visual Inputs (OPIfVI)

4.1 Off-policy Learning

To improve data-efficiency, OPIfVI adopts an off-policy training manner via an off-policy generator SAC. SAC uses a replay buffer to store historical samples that are collected by previous policies and randomly samples data from this buffer to train its policy and Q-value networks (Haarnoja et al., 2018). Such a replay buffer enables samples to be utilized multiple times for policy improvement, thus making learning more data-efficient. However, this off-policy learning scheme poses a threat to the training stability of OPIfVI.

Compared to on-policy adversarial IL algorithms (Torabi et al., 2018b; Zhang et al., 2020), OPIfVI updates the discriminator with data from a replay buffer to improve its ability on discriminating samples, resulting in an off-policy training mode for the discriminator. It is difficult to estimate the characteristics of the current generator and discriminator from off-policy samples as accurately as on-policy samples. As a result, the off-policy update of the adversarial training structure would be less stable (Kostrikov et al., 2019). What’s worse, this off-policy regime is likely to over-fit to training data, leading to severe training instability or even failures of imitation as shown in Rafailov et al. (2021); Hoshino et al. (2021). In OPIfVI, to enhance the training stability against the drawbacks of off-policy learning, we employ spectral normalization (Miyato et al., 2018; Cheng et al., 2021) to force the discriminator to be local Lipschitz-continuous. Local Lipschitz continuity of the learned reward function is necessary to achieve excellent performance of off-policy adversarial IL algorithms (Blondé et al., 2020).

With spectral normalization, the performance of OPIfVI as well as its stability are significantly improved. In particular, different from Anonymous (2022), OPIfVI does not save rewards into the replay buffer and re-calculates rewards with the newest discriminator for the policy improvement. This dynamical reward calculation approach provides more accurate rewards for every update and is able to learn much faster than Anonymous (2022).

4.2 Data Augmentation

Unlike state inputs, where every element of a state stands for practical physical meanings and is irreplaceable, images contain plenty of redundant information. In ILfVI, agents struggle to select actions based on these high-dimensional inputs and need first to extract meaningful features from pixels. For example, in robot locomotion tasks, we suppose agents can accurately estimate the angles and angle velocities of robot joints, which is defined as the state of a robot (Peng et al., 2018), from images for decision-making. However, it is challenging to learn what is essential for making decisions from several consecutive images. To help extract meaningful features from images, we employ data augmentation in OPIfVI.

Data augmentation is used to enlarge expert data and agent data, and helps suppress overfitting as well as enhance robustness (Shorten & Khoshgoftaar, 2019). In a way, the data-efficiency in OPIfVI is also improved because we can obtain more samples by augmenting . Due to the property of images, modifying a series of pixels in an image would not distort the core information (Shorten & Khoshgoftaar, 2019). Hence, data augmentation helps OPIfVI to learn invariant features from images. In OPIfVI, we employ random crop to augment visual inputs, which is considered to be simple and can dramatically improve the performance (Yarats et al., 2020). Data augmentation is vital for the successful imitation of our algorithm OPIfVI, which is empirically studied in Subsection 5.3.

4.3 Encoder Training Structure

As discussed in Subsection 4.2, images contain a large portion of redundant information that is useless for agents to make decisions. It is significant for ILfVI to extract useful features from images and prevent distracting agents from selecting proper actions by the useless information. Therefore, OPIfVI employs encoders to extract features from images. An encoder perceives an image in RGB form and outputs low-dimensional features (Wang et al., 2019). In our framework, three networks (the policy , Q-value function , and discriminator ) need to use the encoder structure to extract features. As a result, how to regulate the encoders of those networks and properly train encoders to improve their ability on extracting features is challenging.

First, we share the encoder between the actor and the critic . We denote the encoder with , which can output a latent feature from augmented adjacent images,

. After encoding, latent features are input to two separate MLPs (multilayer perceptrons) to build the policy and Q-value functions, leading to policy

and Q-value . The encoder’s parameters are only updated with gradients from Q-value with losses in Algorithm 1. This separate update law is inspired by SAC-AE (Yarats et al., 2019), which shows that training the encoder with only gradients from Q-value network performs better and more stable than training with the gradients from both the actor and critic. Second, the discriminator also needs an encoder block. The problem becomes how to design the encoder for the discriminator network.

Generally, we have three choices: 1) maintain a separate encoder for the discriminator; 2) share the encoder of Q-value functions with the discriminator and co-train the encoder with gradients from both the discriminator and Q-value functions; 3) share the encoder with the discriminator but do not backpropagate gradients from the discriminator to train the encoder. Resembling previous work GAIfO (Torabi et al., 2018a), we choose to hold an independent encoder for the discriminator rather than sharing the encoder of Q-value functions. We choose this structure due to the following reasons. The crucial role of a discriminator is to discriminate whether an image is sampled from the agent or the expert. Many successful GAN methods, whose discriminators possess a separate encoder, have been developed (Skorokhodov et al., 2021; Karras et al., 2021)

. Besides, our discriminator is spectral normalized, which enforces the Lipshitz continuity of networks. The Lipshitz continuity could impair the representational capacity of neural networks, making it difficult to share parameters of encoders between the discriminator and the generator. Furthermore, the discriminator

is trained with losses such as binary cross entropy loss (BCE Loss (Osa et al., 2018)). This kind of loss could distract the encoder from extracting meaningful features for decision-making.

In OPIfVI, we share an encoder between the policy and the Q-value network but maintain a separate encoder for the discriminator. The encoders help extract features from images for downstream applications, such as selecting actions or discriminating samples. The shared encoder between and is trained with mere gradients from Q-value loss functions, while the encoder of the discriminator is updated with the discriminator loss. This design plays an important role in stabilizing the training of the imitator and achieves better performance compared to previous structures.

5 Experimental Results

We conduct experiments with DeepMind Control Suite (Tassa et al., 2018) to demonstrate the performance of OPIfVI and compare it against other baselines. We aim to answer the following questions:

  • Can OPIfVI successfully reproduce expert policies from visual inputs and outperform other baselines regarding both data-efficiency and performance?

  • Does every modification that we adopt help improve the performance or data-efficiency of OPIfVI, and what role does it play. In particular, we investigate the effects of spectral normalization in the off-policy training manner, data augmentation, and the encoder training structure.

5.1 Setups and Baselines

We choose four typical environments in DeepMind Control Suite (Tassa et al., 2018), i.e., CartPole Swingup, Walker Walk, Hopper Stand, and Cheetah Run. First, we use DrQ (Yarats et al., 2020) to train experts in these environments and then sample data using trained experts. Visual observations can be obtained by removing actions in visual demonstrations. Then, we can conduct IL experiments with the acquired visual inputs. More details on environments, constructing expert data, and hyperparameters are deferred to the Appendix.

We compare OPIfVI against two spectra of IL algorithms because visual inputs can be divided into visual demonstrations and visual observations. Hence, different baselines are employed corresponding to visual demonstrations or observations, which are briefly introduced below. To achieve fair evaluations, we use the identical data augmentation technique as in Yarats et al. (2020) and neural network architectures for inchoate algorithms.

Baselines for visual demonstrations

Corresponding to BC and IRL in IL, we select two baselines for visual demonstrations, i.e., VBC (Young et al., 2020) and P-DAC (Anonymous, 2022). VBC directly extends BC to the situation, where agents should take actions based on images, by utilizing a neural network architecture that contains CNNs and MLPs. For IRL, P-DAC in the concurrent work is employed to serve as the baseline because it demonstrates state-of-the-art performance.

Baselines for visual observations

We choose P-SIL and P-DAC in Anonymous (2022) as counterparts for visual observations. Note that P-SIL and P-DAC are trained with expert data , which slightly differs from the LfO setting. Hence, we conduct experiments to investigate the performance of OPIfVI with both and . To distinguish, we denote them as OPIfVI and OPIfVI-s, respectively.

(a) Visual demonstrations
(b) Visual observations
Figure 2: Performance of OPIfVI compared to other baselines on DeepMind Control tasks. Performance is measured with episode cumulative rewards, which is averaged across 5 random seeds, and the x-axis is the number of interactions with the environment.

5.2 Results

We conduct experiments of OPIfVI with both visual demonstrations and visual observations. Besides, baselines corresponding to the two spectra are implemented and evaluated. The learning curves are shown in Figure 2. It is clear from the learning curves that: (1) OPIfVI is able to primely replicate expert behaviors and achieve expert-level performance no matter visual demonstrations and observations are provided, while the other baselines fail to perform as similar as experts with same training steps; (2) OPIfVI noticeably outperforms the baselines regarding both final performance and data-efficiency. For example, at 700k steps in Walker Walk, OPIfVI achieves about × and 5.5× higher scores compared to PDAC in visual demonstrations and visual observations, respectively.

5.3 Ablation Studies

First, we study the impacts of spectral normalization and data augmentation in OPIfVI. Concretely, we compare the performance of OPIfVI against its versions without spectral normalization and/or data augmentation, whose results are visualized in Figure 3. From Figure 3, we can see that OPIfVI even could not reproduce a satisfactory policy to mimic experts in most environments without either of them. For example, in CartPole Swingup, OPIfVI/DA only achieves about a quarter of the OPIfVI’s performance with visual demonstrations. In more complex environments, only OPIfVI is able to achieve expert-level performance, which means that both spectral normalization and data augmentation play an important role in OPIfVI.

Second, we conduct additional experiments to validate the encoder training structure. We test four cases: 1) the discriminator maintains a separate encoder and trains it with discriminator losses from scratch (OPIfVI, the structure we adopt); 2) similar to 1) despite that the encoder is trained with both losses from the actor and critic (OPIfVI-2); 3) the discriminator owns an encoder that is shared from the Q-value network and this encoder is trained with only the loss of Q-value functions (OPIfVI-3); 4) the discriminator, actor, and critic possess an independent encoder, respectively, and separately trains their encoders (OPIfVI-4). From the experimental results in Figure 4, we can see that OPIfVI demonstrates excellent performance across different environments and tasks. On the contrary, the other encoder structures could be unstable and perform poorly, especially on Walker Walk and Hopper Stand tasks with visual observations.

(a) Visual demonstrations
(b) Visual observations
Figure 3: Ablation study of spectral normalization (SN) and data augmentation (DA) in OPIfVI. We use OPIfVI to represent the integrated framework, OPIfVI/ to denote that the framework without modification . could be SN, DA, or any union of these two modifications.
(a) Visual demonstrations
(b) Visual observations
Figure 4: Ablation study of encoder structure in OPIfVI.

6 Conclusion

In this paper, we present an imitation learning algorithm, OPIfVI, which can efficiently and effectively learn from visual inputs. OPIfVI works in an off-policy manner with stability enhanced with spectral normalization, improving learning efficiency. In addition, to deal with visual inputs, we adopt data augmentation and design a specific architecture to train the encoder. These two techniques help agents to better identify meaningful features in visual inputs, thus empowering agents to take correct actions. Compared to previous baselines, OPIfVI outperforms them regarding both data-efficiency and final performance.

References

Appendix A Environments and Expert Data

a.1 Environments and Specifications

We choose DeepMind Control Suite (Tassa et al., 2018) to benchmark the performance of IL algorithms, which is widely adopted in previous works (Laskin et al., 2020; Anonymous, 2022). DeepMind Control Suite provides various image-based continuous control tasks, which is suitable for comprehensive evaluations in the setting of ILfVI. Four tasks with different complexities are employed in our experiments, i.e., CartPole Swingup, Walker Walk, Hopper Stand, and Cheetah Run, which are shown in Figure 5. In our experiments, the agent take action according to three consecutive RGB images, and the height and width of images are set to 84 pixels. These configurations are coherent with Yarats et al. (2020). Specifications of the tested tasks are presented in Table 1.

(a) CartPole Swingup
(b) Walker Walk
(c) Hopper Stand
(d) Cheetah Run
Figure 5: Screenshots of the tasks in DeepMind Control Suite.
Environment State Space Image Space Action Space Max-Step
CartPole Swingup 4 4 1000
Walker Walk 18 6 1000
Hopper Stand 14 4 1000
Cheetah Run 18 6 1000
Table 1: Specifications of the DeepMind Control Suite tasks.

a.2 Expert Data

Here, we give more details on how we generate expert data in the experiments. As mentioned before, we use the algorithm DrQ (Yarats et al., 2020) to train experts with default configurations. The action repeat for the four environments is set to 4. With an action repeat of 4, the episode length will be 250. We stack three consecutive frames together to construct . As a result, agents take actions according to whose dimension is (channel last). Once we obtain trained expert policies, we can execute a policy of them in an environment and record the data. The recorded data are used as expert data. The expert data for visual demonstrations and observations are recorded as and , respectively. In particular, we also consider a special kind of visual observations that . For every environment, we sample 20 expert trajectories, i.e., we store 5,000 state-action pairs or state-state pairs. The performance of expert data is listed in Table 2.

Environment Expert
CartPole Swingup 873.8 1.5
Walker Walk 943.1 22.4
Hopper Stand 860.7 48.6
Cheetah Run 675.0 30.9
Table 2: Performance of expert data.

Appendix B Implementation Details

Our algorithm OPIfVI is implemented based on two open source codes, DrQ

(Yarats et al., 2020) and OpenAI Baselines (Dhariwal et al., 2017). The generator in OPIfVI is highly based on DrQ despite not adopting the augmentation for Q-value functions. The discriminator owns an identical encoder structure as the generator, and it is spectral normalized. The counterparts that we compare to, P-SIL and P-DAC, are the anonymous official implementation in Anonymous (2022). Since that the expert data in Anonymous (2022) are not provided, we use the expert data constructed in above subsection instead. For fair comparisons, the action repeat is set to 4, and the batchsize is set to 128, while the other hyperparameters are the default for P-SIL and P-DAC. The hyperparameters for our experiments are presented in Table 3.

Hyperparameters Value
Environment parameters
  Image size
  Action repeat 4
  Frame stack 3
Common parameters
  Activation ReLU
  Batch size 128
  Optimizer Adam
  Encoder feature dim 50
  Actor update frequency 2
  Critic update frequency 1
  Discriminator update frequency 1
SAC parameters
  MLP network size (1024,1024)
  Discount 0.99
  Learning rate
  Initial temperature 0.1
  Ployak 0.01
Discriminator parameters
  Learning rate
  MLP network size (1024,1024)
Table 3: Hyperparameters in experiments.