Imitation learning serves as a powerful framework for getting robots to learn complex skills in visually rich environments [52, 40, 11, 54, 49]. Recent works in this area have shown promising results in generalization to previously unseen environments for robotic tasks such as pick and place, pushing, and rearrangement . However, such generalization is often too narrow to be directly applied in the diverse real-world application. For instance, policies trained to open one door rarely generalize to opening different doors . This lack of generalization is further exacerbated by the plethora of different options to achieve generalization: either needing hundreds of diverse demonstrations, task-specific priors, or large parametric models. This begs the question: What really matters for generalization in visual imitation?
An obvious answer is visual representation – generalizing to diverse visual environments should require powerful representation learning. Prior work in computer vision[16, 7, 8, 5, 4] have shown that better representations significantly improve downstream performance for tasks such as image classification. However, in the case of robotics, evaluating the performance of visual representations is quite complicated. Consider behavior cloning 
, one of the simplest methods of imitation. Standard approaches in behavior cloning fit convolutional neural networks on a large dataset of expert demonstrations using end-to-end gradient descent. Although powerful, such models conflate two fundamental problems in visual imitation: (a) representation learning, i.e. inferring information-preserving low-dimensional embeddings from high-dimensional observations and (b) behavior learning, i.e. generating actions given representations of the environment state. This joint learning often results in large dataset requirements for such techniques.
One way to achieve this decoupling is to use representation modules pre-trained through standard proxy tasks such as image classification, detection, or segmentation . However, this relies on large amounts of labelled human data on datasets that are often significantly out of distribution to robot data . A more scalable approach is to take inspiration from recent work in computer vision, where visual encoders are trained using self-supervised losses [8, 7, 16]
. These methods allow the encoders to learn useful features of the world without requiring human labelling. There has been recent progress in vision-based Reinforcement Learning (RL) that improves performance by creating this explicit decoupling[41, 48]. Visual imitation has a significant advantage over RL settings: learning visual representations in RL is further coupled with challenges in exploration , which has limited its application in real-world settings due to poor sample complexity.
In this work we present a new and simple framework for visual imitation that decouples representation learning from behavior learning. First, given an offline dataset of experience, we train visual encoders that can embed high-dimensional visual observations to low-dimensional representations. Next, given a handful of demonstrations, for a new observation, we find its associated nearest neighbors in the representation space. For our agent’s behavior on that new observation, we use a weighted average of the nearest neighbors’ actions. This technique is inspired by Locally Weighted Regression 
, where instead of operating on state estimates, we operate on self-supervised visual representations. Intuitively, this allows the behavior to roughly correspond to a Mixture-of-Experts model trained on the visual demonstrations. Since nearest neighbors is non-parametric, this technique requires no additional training for behavior learning. We will refer to our framework as Visual Imitation through Nearest Neighbors (VINN).
Our experimental analysis demonstrates that VINN can successfully learn powerful representations and behaviors across three manipulation tasks: Pushing, Stacking, and Door Opening. Surprisingly, we find that non-parametric behavior learning on top of learned representations is competitive with end-to-end behavior cloning methods. On offline MSE metrics, we report results on par with competitive baselines, while being significantly simpler. To further test the real-world applicability of VINN, we run robot experiments on opening doors using 71 visual demonstrations. Across a suite of generalization experiments, VINN succeeds 80% on doors present in the demonstration dataset and 40% on opening the door in novel scenes. In contrast, our strongest baselines have success rates of 53.3% and 3.3% respectively.
To summarize, this paper presents the following contributions. First, we present VINN, a novel yet simple to implement visual imitation framework that derives non-parametric behaviors from learned visual representations. Second, we show that VINN is competitive to standard parametric behavior cloning and can outperform it on a suite of manipulation tasks. Third, we demonstrate that VINN can be used on real robots for opening doors and can achieve high generalization performance on novel doors. Finally, we extensively ablate over and analyze different representations, amount of training data, and other hyperparameters to demonstrate the robustness of VINN.
Ii Related Work
Ii-a Imitation via Cloning
Imitation learning is frequently used to learn skills and behaviors from human demonstrations [31, 27, 26, 42]. In the context of manipulation, such techniques have successfully solved a variety of problems in pushing, stacking, and grasping [52, 53, 2, 19]. Behavioral Cloning (BC) 
is one of the most common techniques. If the agent’s morphology or viewpoint is different than the demonstrations’, the model needs to involve techniques such as transfer learning to resolve this domain gap[40, 36]. To close this unintended domain gap,  has used tele-operation methods, while [39, 49] have used assistive tools. Using assistive tools provides us the benefit of being a able to scalably collect diverse demonstrations. In this paper, we follow the DemoAT  framework to collect expert demonstrations.
Ii-B Visual Representation Learning
In computer vision, interest in learning a good representation has been longstanding, especially when labelled data is rare or difficult to collect [7, 8, 16, 5]. This large class of representation learning techniques aim to extract features that can help other models improve their performance in some downstream learning tasks, without needing to explicitly learn a label. In such tasks, first a model is trained on one or more pretext tasks with this unlabeled dataset to learn a representation. Such tasks generally include instance invariance, or predicting some image transformation parameters (e.g. rotation and distortion), patches, or frame sequence [15, 10, 9, 28, 7, 8, 46]. In representation learning, the performance of the model on the pretext task is usually disregarded. Instead, the focus is on the input domain to representation mapping that these models have learned. Ideally, to solve such pretext tasks, the pretrained model may have learned some useful structural meaning and encoded it in the representation. Thus, intuitively, such a model can be used in downstream tasks where there is not enough data to learn this structural meaning directly from the available task-relevant data. Unsupervised representation learning, in works such as [7, 8, 16, 5, 4, 12], has shown impressive performance gains on difficult benchmarks since they can harness a large amounts of unlabelled data unavailable in task-specific datasets.
Recently, interest in unsupervised or semi-supervised representation learning technique has grown within robotics  due to the availability of unlabeled data and its effectiveness in visual imitation tasks [50, 51]. We follow a BYOL-style  self-supervised representation learning framework in our experiments.
Ii-C Non-parametric Control
Non-parametric models are those, which instead of modeling some parameters about the data distribution, tries to express it in terms of previously observed training data. Non-parametric models are significantly more expressive, but as a downside to this, they usually require a large number of training examples to generalize well. A popular and simple example of non-parametric models is Locally Weighted Learning (LWL) . LWL is a form of instance-based, non-parametric learning that refers to algorithms whose response to any query is a weighted aggregate of similar examples. Simple nearest neighbor models are an example of such learning, where all weight is put on the closest neighbor to the input point. Nearest neighbor methods have been successfully used in previous works for control tasks  More sophisticated, -NN algorithms base their predictions on an aggregate of the nearest points .
, effectiveness of LWL algorithms like k-nearest neighbor has shown competitive success in difficult, high dimensional tasks like classifying the miniImageNet. LWL has also shown success for robotic control problems, although it requires an accurate state-estimator to obtain low-dimensional states. In [22, 32, 33], elements of non-parametric learning is weaved into the reinforcement learning algorithms to create models which can adjust their complexity based on the amount of available data. Finally, in works like  non-parametric k-Nearest Neighbor regression based Q-functions are shown to give a good approximation of the true Q function under some theoretical assumptions. Our work, VINN, draws inspiration from the simplicity of LWL and demonstrates the usefulness of this idea by using Locally Weighted Regression in challenging visual robotic tasks.
In this section, we describe the components of our algorithms and how they fit together to create VINN. As seen in Fig. 2, VINN consists of two parts: (a) training an encoding network on offline visual data, and (b) querying against the provided demonstrations for a nearest-neighbor based action prediction.
Iii-a Visual Representation Learning
Given an offline dataset of visual experience from the robot, we first learn a visual representation embedding function. In this work, we use two key insights for learning our visual representation: first, we can learn a good vision prior using existing large but unrelated real world datasets, and then, we can fine-tune starting from that prior using our demonstration dataset, which is small but relevant to the task at hand.
Then, we use self supervised learning and train this visual encoder on the all the frames in our offline training dataset. In this work, we use Bootstrap Your Own Latent (BYOL)  as the self-supervision objective. As illustrated in Fig. 2
, BYOL uses two versions of the same encoder network: one normally updating online network, and a slow moving average of the online network called the target network. The BYOL self-supervised loss function tries to reduce the discrepancy in the two heads of the network when they are fed with differently augmented version of the same image. Although we use BYOL in this work, VINN can also work with other self-supervised representation learning methods[7, 8, 5, 4] (Table III).
In practice, we initialize both the BYOL online and target networks with an ImageNet-pretrained encoder. Then, using the BYOL objective, we finetune them to better fit our image distribution. Once the self-supervised training is done, we encode all our training demonstration frames with the encoder to obtain a set of their embeddings,.
Iii-B -Nearest Neighbors Based Locally Weighted Regression
The set of embeddings given by our encoder holds compact representations of the demonstration images. Thus, during test time, given an input we search for demonstration frames with similar features. We find the nearest neighbors of the encoded input from the set of demonstration embeddings, . In Fig. 3, we see that these nearest neighbors are visually similar to the query image. Our algorithm implicitly assumes that a similar observation must result in a similar action. Thus, once we have found the nearest neighbors of our query, we set the next action as an weighted average of the actions associated with those nearest neighbors.
Concretely, this is done by performing nearest neighbors search based on the distance between embeddings: , where is the nearest neighbor. Once we find the nearest neighbors and their associated actions, namely , we set the action as the Euclidean kernel weighted average  of those examples’ associated actions:
In practice, this turns out to be the average of the observations’ associated actions weighted by the SoftMin of their distance from the query image in the embedding space.
Iii-C Deployment in real-robot door opening
For our robotic door opening task, we collect demonstrations using the DemoAT  tool. Here, a reacher-grabber is mounted with a GoPro camera to collect a video of each trajectory. We pass the series of frames into a structure from motion (SfM) method which outputs the camera’s location in a fixed frame . From the sequence of camera poses, which consist of coordinate and orientation, we extract translational motion which becomes our action. To extract the gripper state, we train a gripper network that outputs a distribution over four classes (open, almost open, almost closed, closed), which represent various stages of gripping. Then, we feed these images and their corresponding actions into our imitation learning method.
To train our visual encoders, we train ImageNet-pretrained BYOL encoders on individual frames in our demonstration dataset without action information. This same dataset with action information serves as the demonstration dataset for the -NN based action prediction. Note that although we use task-specific demonstrations for representation learning, our framework is compatible with using other forms of unlabelled data such offline datasets [17, 14] or task-agnostic play data .
To execute our door-opening skill on the robot, we run our model on a closed loop manner. After resetting the robot and the environment, on every step, we retrieve the robot observation and query the model with it. The model returns a translational action as well as the gripper state , and the robot moves
where the vectoris a hyper-parameter with each element to mitigate our SfM model’s inaccuracies and improve transfer from human demonstrations to robot execution. In addition, for nearest neighbor based methods, we have hyper-parameters that map the floating value into a gripper state which was tuned per experiment.
Iv Experimental Evaluation
In the previous sections we have described our framework for visual imitation, VINN. In this section, we seek to answer our key question: how well does VINN imitate human demonstrations? To answer this question, we will evaluate both on offline datasets and in closed-loop real-robot evaluation settings. Additionally, we will probe into the generalization with few demonstrations ability of VINN in settings where imitation algorithms usually suffer.
Iv-a Experimental Setup
We conduct two different set of experiments: the first on the offline datasets for Pushing, Stacking and Door-Opening and the second on real-robot door opening.
Offline Visual Imitation Datasets
Data for Pushing and Stacking tasks are taken from . The goal in the pushing task is to slide an object on a surface into a red circle. In the stacking task, the goal is to grasp an object present in the scene and move it on top of another object also in the scene, and release. To avoid confusion, in the expert demonstrations for stacking, the closest object is always placed on top of the distant object. The action labels are end-effector movements, which in this case is the translation vector in between the current frame and the subsequent one. In each case, there are a diverse set of backgrounds and objects that make up the scene and the task, making the tasks difficult.
For Door Opening, data is collected by 3 data-collectors in their kitchens. This amounts to a total of 71 demonstrations for training and 21 demonstrations for testing. We normalize all actions from the dataset to account for scale ambiguity from SfM. For all three tasks, we calculate the MSE loss between the ground truth actions and the actions predicted by each of the methods. Note that the number of demonstrations collected for this Door Opening task is an order of magnitude smaller than the ones used for Stacking and Pushing, which contain around 750 and 930 demonstrations respectively. To understand the performance on the various model in low data settings, we create subsampled Pushing and Stacking datasets containing 71 demonstrations on each for training and 21 for testing. This subsampling makes all three our datasets have the same size.
We conduct our robot experiments on a loaded cabinet door opening task (see Fig. 1), where the goal of the robot is to grab hold of the cabinet handle and pull open the cabinet door. We use the Hello-Robot Stretch  for this experiment. When evaluations start, the arm resets to meters away from the cabinet door, with a random lateral translation within meters parallel to the cabinet to evaluate generalization to varying starting states.
We run our experiments for baseline comparison using the following methods:
Random Action: In this baseline, we sample a random action from the action space.
Open Loop: We find the maximum-likelihood open loop policy given all our demonstration, which is the average action over all actions seen in the dataset at timestep . In a Bayesian sense, if standard behavioral cloning is trying to approximate , this model is trying to approximate .
Implicit Behavioral Cloning: We train Implicit BC  models on the tasks, modifying the official code.
ImageNet features + NN: Instead of self-supervision, here we use the image representation generated by a pretrained ImageNet encoder akin to . The difference between this baseline and our method is simply forgoing the finetuning step on our dataset. This baseline highlights the importance of self-supervised pre-training on the domain related dataset.
Self-supervised learning method + NN: This is our method; we compare three different ways of learning self-supervised representations features from our dataset – BYOL , SimCLR , and VICReg , starting from an ImageNet pretrained ResNet-50, and then we use locally weighted regression to find the action.
Iv-C Training Details
Each encoder network used in this paper follows the ResNet-50 architecture  with the final linear layer removed. Unless specified otherwise, we always initialize the weights of the ResNet-50 encoder with a pretrained model on ImageNet dataset. For VINN, we train our self-supervised encodings with the BYOL 
loss. For standard end-to-end BC, we replace the last linear layer with a three-layer MLP and train it with the MSE loss. For BC-rep, we freeze the encoding network to the weights trained by BYOL on our dataset, and train just the final layers with the MSE loss. Additionally, for all visual learning, we use random crop, random color jitter, random grayscale augmentations and random blurring. We trained the self-supervised finetuning methods for 100 epochs on all three datasets.
Iv-D How does VINN Perform on Offline Datasets?
For our first evaluation, we compare our method against the baselines on their Mean-Squared Error loss for the Pushing, Stacking, and Door-Opening tasks in Fig. 4. To understand the impact of the training dataset size on the algorithms, we train models on multiple subsamples of different sizes from each dataset. We see that while end-to-end Behavioral Cloning starting from pretrained ImageNet representations can be better with a large amounts of training demonstrations, Nearest Neighbor methods are either competitive or better performing in low data settings.
On the Stacking and Door-Opening tasks, VINN is significantly better when the number of training demonstrations are small (). While on the Pushing task, we notice that the task might be too difficult to solve with small number of demonstrations. One reason for this is that BYOL might not be able to extract the most relevant representations for this task. Further experiments in Table III show that using other forms of self-supervision such as VICReg can significantly improve performance on this task. Overall, these experiments supports our hypothesis that provided with good representations, nearest-neighbor techniques can provide a competitive alternative to end-to-end behavior cloning.
Iv-E How does VINN Perform on Robotic Evaluation?
Next, we run VINN and the baselines on our real robot environment. In this setting, our test environment comprises of the same three cabinets where training demonstrations were collected presented without any visual modifications. For each of our models, we run 30 rollouts with the robot in the real world with three different cabinets. On each rollout, the starting position of the robot is randomized as detailed in (Sec. IV-A). In Table I, we show the percentage of success from the 30 rollouts of each model, where we record both the number of time the robot successfully grasped the handle, as well as the number of time it fully opened the door.
|Method||Handle grasped||Door opened|
|BC (end to end)||0%||0%|
|BC on representations||56.7%||53.3%|
|Imagenet features + NN||20%||0%|
|VINN (BYOL + NN)||80%||80%|
As we see from Table I, VINN does better than all BC variants in successfully opening the cabinet door when there is minimal difference between the test and the train environments. Noticeably, it shows that depending on self-supervised features on augmented data make the models much more robust. BC, as an end-to-end parameteric model, does not have a strong prior on the actions if the robot makes a wrong move causing the visual observations to quickly goes out-of-distribution . On the other hand, VINN can recover up to certain degree of deviation using the nearest neighbor prior, since the translation actions typically tend to re-center the robot instead of pushing it further out of distribution.
Iv-F To What Extent does VINN Generalize to Novel Scenes?
To test generalization of our robot algorithms to novel scenes in the real world, we modified one of our test cabinets with various levels of occlusion. We show frames from a sample rollouts in each environment in Fig. 5, which also shows the cabinet modifications.
|Baseline (no modifications)||90%||80%|
|Covered signs and handle||10%||70%|
|Covered signs, handle, and one bin||0%||50%|
|Covered signs, handle, and both bins||0%||0%|
In Table II, we see that VINN only completely fails when all the visual landscape on the cabinet is occluded. This failure is expected, because without coherent visual markers, the encoder fails to convey information, and thus the k-NN part also fails. Even then, we see that VINN succeeds at a higher rate even with significant modifications to the cabinet while BC-rep fails completely.
Over all the real robot experiments, we find the following phenomenon: while a good MSE loss is not sufficient for a good performance in the real world, the two are still correlated, and a low MSE loss seems to be necessary for good real world performance. This observation let us test hypotheses offline before deploying and testing them in a real robot, which can be time-consuming and expensive. We hypothesize that this gap between performance on the MSE metric (Table III) and real world performance (Table I, II) comes from variability in different models’ ability to perform well in situations off the training manifold, where they may need to correct previous errors.
|No Pretraining||With ImageNet Pretraining|
on predicted actions for a set of baseline methods and ablations. Standard deviations, when reported, are over three randomly initialized runs.
Iv-G How Important are the Design Choices Made in VINN for Success?
VINN comprises of two primary components, the visual encoder and the nearest-neighbor based action modules. In this section, we consider some major design choices that we made for each of them.
Choosing the Right Self-supervision
While we use a BYOL-based self-supervised encoding in our algorithm, there are multiple other self-supervised methods such as SimCLR and VICReg [7, 4]. On a small set of experiments we noticed similar MSE losses compared to SimCLR  and VICReg . From Table III, we see that BYOL does the best in Door-Opening and Stacking, while VICReg does better in Pushing. However, we choose BYOL for our robot experiments since it requires less tuning overall.
Ablating Pretraining and Fine-tuning
Another large gain in our algorithm is achieved by initializing our visual encoders with a network trained on ImageNet. In Table III, we also show MSE losses from models that resulted from ablating this components of VINN. Removing this component achieves the column BYOL + NN (No Pretraining), which performs much worse than VINN. Similarly, the success of VINN depends on the self-supervised fine-tuning on our dataset, ablating which results in the model shown in ImageNet + NN column of Table III. This model performs only slightly worse than VINN on the MSE metric. However, in Table I, we see that this model performs poorly on the real world. These ablations show that the performance of our locally weighted regression based policy depends on the quality of the representation, where a good representation leads to better nearest neighbors, which in turn lead to a better policy both offline and online.
Performing Implicit instead of Explicit Imitation
Moving away from the explicit forms of imitation where the models try to predict the actions directly, we run baselines with Implicit Behavioral Cloning (IBC) . As we see on Table III, this baseline fails to learn behaviors significantly better than the random or open loop baselines. We believe this is caused by two reasons. First, the implicit models have to model the energy for the full space (action space observation space), which requires more data than the few demonstrations that we have in our datasets. Second, the official implementation of IBC supports as the action space instead of its much smaller subspace of normalized 3d vectors . This much larger action space, over which IBC tried to model the action, might have resulted in worse performance for IBC. While VINN makes the implicit assumption that the locally-weighted average of valid actions also yield a valid action, it can be freely projected to any relevant space without further processing, which makes it more flexible.
Learning a Parametric Policy on Representations
Our Behavioral Cloning on representations (BC-Rep) baseline in all our experiments (Sec. IV) show the performance of a baseline where we use learned representations to learn a parametric behavioral policy. In the MSE losses (Table III) and real world experiments (Table I, II.) This is the baseline that achieves the closest performance to VINN. However, the difference between BC-rep and VINN becomes more pronounced as the gap between training and test domain or the policy horizon grows. These experimental results indicate that using a non-parametric policy may be enabling us to be robust to out-of-distribution samples.
Choosing the Right for -Nearest Neighbors
Finally, in VINN, we study the effect of different values of for the -NN based locally weighted controller. This parameter is important because with too small of a , the predicted action may stop being smooth. On the other hand, with too large of a , unrelated examples may start influencing the predicted action. By plotting our model’s normalized MSE loss in the validation set against the value of in Fig. 6, we find that around , seems ideal for achieving low validation loss while averaging over only a few actions. Beyond , we didn’t notice any significant improvement to our model from increasing .
Iv-H Computational Considerations
While the datasets we used for our experiments were not large, we recognize that our current nearest neighbor implementation is a algorithm dependant linearly on the size of the training dataset with a naive algorithm. However, we believe VINN to be practical, since firstly, it was designed mostly for the small demonstration dataset regime where
is quite small, and secondly, this search can be sped up with a compiled index beyond the naive method using open-source libraries such as FAISS which were optimized to run nearest neighbor search on the order of billion examples . Currently, our algorithm takes seconds to encode an image, and seconds to perform nearest neighbors regression, which is only a small speed penalty for the robotic tasks we consider.
V Limitations and Future Work
In this work we proposed VINN, a new visual imitation framework that decouples visual representation learning from behavior learning. Although this decoupling improves over standard visual imitation methods, there are several avenues for future work. First, there is still some remaining hurdles to generalizing to a new scene, as seen in Sec. IV-F, where our model fails when all large, recognizable markers are removed from the scene. While our NN-based action estimation lets us add new demonstrations easily, we cannot easily adapt our representation to such drastic changes in scene. An incremental representation learning algorithm has great potential to improve upon that. Second, our self-supervised learning is currently done on task related data, while ideally, if the dataset is expansive enough, task agnostic pre-training should also give us good performance . Finally, although our framework focuses on a single-task setting, we believe that learning a joint representation for multiple tasks could reduce the overall training overhead while being just as accurate.
We thank Rohith Mukku for his help with writing code and running ablation experiments. We thank Dhiraj Gandhi, Pete Florence, and Soumith Chintala for providing feedback on an early version of this paper. This work was supported by grants from Honda, Amazon, and ONR award numbers N00014-21-1-2404 and N00014-21-1-2758.
-  (1994) Learning to catch: applying nearest neighbor algorithms to dynamic control tasks. In Selecting Models from Data, P. Cheeseman and R. W. Oldford (Eds.), New York, NY, pp. 321–328. External Links: Cited by: §II-C.
-  (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §II-A.
-  (1997) Locally weighted learning. Lazy learning, pp. 11–73. Cited by: §I, §II-C, §II-C, §III-B.
Vicreg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906. Cited by: §I, §II-B, §III-A, 7th item, §IV-G.
-  (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §I, §II-B, §III-A.
-  (2020) Robust policies via mid-level visual representations: an experimental study in manipulation and navigation. arXiv preprint arXiv:2011.06698. Cited by: §I, 6th item.
-  (2020) Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029. Cited by: §I, §I, §II-B, §III-A, 7th item, §IV-G.
-  (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §I, §I, §II-B, §III-A.
-  (2016) Unsupervised visual representation learning by context prediction. External Links: Cited by: §II-B.
Discriminative unsupervised feature learning with exemplar convolutional neural networks. External Links: Cited by: §II-B.
-  (2017) One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §I.
-  (2021) With a little help from my friends: nearest-neighbor contrastive learning of visual representations. arXiv preprint arXiv:2104.14548. Cited by: §II-B.
-  (2021) Implicit behavioral cloning. arXiv preprint arXiv:2109.00137. Cited by: §-B, 5th item, §IV-G.
-  (2020) D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219. Cited by: §III-C.
-  (2018) Unsupervised representation learning by predicting image rotations. External Links: Cited by: §II-B.
-  (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §I, §I, §II-B, §II-B, §III-A, 7th item, §IV-C.
-  (2020) Rl unplugged: benchmarks for offline reinforcement learning. arXiv e-prints, pp. arXiv–2006. Cited by: §III-C.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §IV-C.
-  (2017) Imitation learning: a survey of learning methods. ACM Computing Surveys (CSUR) 50 (2), pp. 1–35. Cited by: §II-A.
-  (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §IV-H.
-  (2021) The design of stretch: a compact, lightweight mobile manipulator for indoor human environments. arXiv preprint arXiv:2109.10892. Cited by: Fig. 7, §-C, §IV-A.
-  (2016) Robust reinforcement learning with relevance vector machines. Robot Learning and Planning (RLP 2016), pp. 5. Cited by: §II-C.
-  (2018) Simple nearest neighbor policy method for continuous control tasks. External Links: Cited by: §II-C.
-  (2020) Keypoints into the future: self-supervised correspondence in model-based reinforcement learning. arXiv preprint arXiv:2009.05085. Cited by: §II-B.
-  (2018) A survey of product quantization. ITE Transactions on Media Technology and Applications 6 (1), pp. 2–10. Cited by: §IV-H.
-  (1983) Newborn infants imitate adult facial gestures. Child development. Cited by: §II-A.
-  (1977) Imitation of facial and manual gestures by human neonates. Science. Cited by: §II-A.
-  (2016) Shuffle and learn: unsupervised learning using temporal order verification. External Links: Cited by: §II-B.
-  (2017) A survey of structure from motion. External Links: Cited by: §III-C.
Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §III-A.
-  (2013) Play, dreams and imitation in childhood. Vol. 25, Routledge. Cited by: §II-A.
-  (2017) Neural episodic control. External Links: Cited by: §II-C.
-  (2018) Towards generalization and simplicity in continuous control. External Links: Cited by: §II-C.
-  (2010) A reduction of imitation learning and structured prediction to no-regret online learning. In arXiv preprint arXiv:1011.0686, Cited by: §IV-E.
-  (2019) Learning to navigate using mid-level visual priors. arXiv preprint arXiv:1912.11121. Cited by: §I.
-  (2016) Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699. Cited by: §II-A.
-  (2018) Q-learning with nearest neighbors. External Links: Cited by: §II-C.
-  (2017) Prototypical networks for few-shot learning. External Links: Cited by: §II-C.
-  (2020) Grasping in the wild: learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters 5 (3), pp. 4978–4985. Cited by: §II-A.
-  (2017) Third-person imitation learning. ICLR. Cited by: §I, §II-A.
Decoupling representation learning from reinforcement learning.
International Conference on Machine Learning, pp. 9870–9879. Cited by: §I.
-  (1993) Imitative learning of actions on objects by children, chimpanzees, and enculturated chimpanzees. Child development 64 (6), pp. 1688–1705. Cited by: §II-A.
-  (2018) Behavioral cloning from observation. arXiv preprint arXiv:1805.01954. Cited by: §I, §II-A, 3rd item.
-  (2019) DoorGym: A scalable door opening environment and baseline agent. CoRR abs/1908.01887. External Links: Cited by: §I.
-  (2019) SimpleShot: revisiting nearest-neighbor classification for few-shot learning. External Links: Cited by: §II-C.
-  (2018) Unsupervised feature learning via non-parametric instance-level discrimination. External Links: Cited by: §II-B.
-  (2021) Mastering visual continuous control: improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645. Cited by: §I.
-  (2021) Reinforcement learning with prototypical representations. arXiv preprint arXiv:2102.11271. Cited by: §I.
-  (2020) Visual imitation made easy. arXiv e-prints, pp. arXiv–2008. Cited by: §-D, §I, §II-A, §III-C, 3rd item, §IV-A.
-  (2021) Playful interactions for representation learning. arXiv preprint arXiv:2107.09046. Cited by: §II-B, §III-C, 4th item, §V.
-  (2020) A framework for efficient robotic manipulation. arXiv preprint arXiv:2012.07975. Cited by: §II-B.
-  (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In ICRA, Cited by: §I, §II-A.
-  (2018) Reinforcement and imitation learning for diverse visuomotor skills. CoRR abs/1802.09564. External Links: Cited by: §II-A.
-  (2018) Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564. Cited by: §I.