Graph-Structured Visual Imitation

by   Maximilian Sieb, et al.
Carnegie Mellon University

We cast visual imitation as a visual correspondence problem. Our robotic agent is rewarded when its actions result in better matching of relative spatial configurations for corresponding visual entities detected in its workspace and teacher's demonstration. We build upon recent advances in Computer Vision,such as human finger keypoint detectors, object detectors trained on-the-fly with synthetic augmentations, and point detectors supervised by viewpoint changes and learn multiple visual entity detectors for each demonstration without human annotations or robot interactions. We empirically show the proposed factorized visual representations of entities and their spatial arrangements drive successful imitation of a variety of manipulation skills within minutes, using a single demonstration and without any environment instrumentation. It is robust to background clutter and can effectively generalize across environment variations between demonstrator and imitator, greatly outperforming unstructured non-factorized full-frame CNN encodings of previous works.



There are no comments yet.


page 2

page 6


Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration

We introduce a simple new method for visual imitation learning, which al...

Object Manipulation Learning by Imitation

We aim to enable robot to learn object manipulation by imitation. Given ...

Reward Learning from Narrated Demonstrations

Humans effortlessly "program" one another by communicating goals and des...

FlowControl: Optical Flow Based Visual Servoing

One-shot imitation is the vision of robot programming from a single demo...

Learning from Observations Using a Single Video Demonstration and Human Feedback

In this paper, we present a method for learning from video demonstration...

The Surprising Effectiveness of Representation Learning for Visual Imitation

While visual imitation learning offers one of the most effective ways of...

Regression via Kirszbraun Extension with Applications to Imitation Learning

Learning by demonstration is a versatile and rapid mechanism for transfe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans learn skills by watching other humans [3]. The ability to learn from observation —called visual imitation [4] or third-person imitation [5]

— has always been a much-desired goal in artificial intelligence as a means of quickly programming agents in an intuitive manner, as opposed to hard-coding their behaviors. Visual imitation requires fine-grained understanding of the demonstrator’s visual scene and its changes over time. The imitator then will use its own embodiment and dynamics to cause a similar change in its own environment. Visual imitation then boils down to learning a visual similarity function between the demonstrator’s and imitator’s environments, whose maximization—via the imitator’s actions—would result in correct skill imitation. This similarity function determines which aspects of the visual observations are relevant to reproducing the demonstrated skill, i.e., it defines what to imitate and what to ignore


We propose hierarchical graph video representations, called Visual Entity Graphs (VEGs), where nodes represent visual entities (objects, parts or points) tracked in space and time, and edges represent their relative 3D spatial arrangements. Our video graph encoding is based on the observation that the appearance of the scene in some level of abstraction (objects, parts or points) remains constant over time, but the spatial arrangements of entities change over time, an observation which follows directly from the laws of Newtonian physics. The proposed hierarchical visual entity graphs disentangle what and where of the scene in multiple levels of abstraction: nodes represent visual entities that persist over time, and edges within each graph represent their relative 3D spatial arrangements that may change over time. For each pair of timesteps, we build two VEGs, one for the demonstrator and one for the imitator. Their nodes are in one-to-one correspondence, as shown in Figure 1. Our imitation reward function then measures agreement of the relative spatial configurations between corresponding node pairs, and guides reinforcement learning of manipulation tasks from a single video demonstration using a handful of real-world interactions.

Figure 1: Graph-structured visual imitation. We show the VEGs for a human demonstration and robot imitation for two timesteps. Corresponding nodes in the human demonstration (left) and robot imitation (right) share the same color. The graphs are hierarchical. Edges exist between object, robot and human hand nodes and point feature nodes, and are added and deleted dynamically over time based on motion saliency, as shown in the figure with solid and dashed lines, respectively. Our graph representation is robust to viewpoint variation between demonstrator and executor, and can handle cluttered backgrounds, as illustrated in the figure.

Under the proposed VEG encoding, visual imitation boils down to learning to detect corresponding visual entities (objects, parts, or points) between the demonstrator’s and imitator’s environments. This requires fine-grained visual understanding of both demonstrator’s and imitator’s environments. The challenge in this visual parsing problem is that objects used by the demonstrator are often not included in the labelled image or object categories of ImageNet

[7] and MS-COCO [8] datasets, thus pre-trained object-detectors are often not useful. We instead opt for scene-specific self-supervised detectors for points and objects. We use self-supervised point visual feature detectors trained by viewpoint changes, and visual detectors of objects and parts trained from synthetically generated images that augment a video at hand. Last, we also use human hand keypoint detectors to parse the teacher’s hand trajectories. The proposed scene-conditioned visual entity detectors establish correspondences between the demonstrator’s and imitator’s workspaces, despite differences in occlusion patterns, viewpoint changes or robot-human body visual discrepancies (Figure 1). We use the resulting reward function for trajectory optimization [9], and show it can imitate a single human demonstration from a handful of real world trials on a Baxter robot.

In summary, our contributions are as follows: i) We propose a what-where hierarchical graph visual encoding for visual correspondence estimation. ii) We propose scene-specific point and object visual detectors, as well as human hand detectors. To the best of our knowledge, this is the first work that uses human finger visual detectors as opposed to environment instrumentation

[10] to track the demonstrator’s hand for visual imitation learning of object manipulations. iii) We imitate using a single demonstration, without any robot random exploration as in [11, 4], or any data of the robot performing the task as in [2]. We do so without ever having access to expert actions. iv) We show imitation results on a real robotic platform.

We compare our proposed representation against full-frame image encodings of previous works [2, 11] that do not use a what-where decomposition during matching. Our experiments suggest they require a very large number of video examples of humans and robots executing the task to acquire generalization abilities similar to our method. They fail to imitate the demonstrated skill most of the times, as we show in our experiments. When humans imitate fellow humans, they are equipped with excellent visual detectors, visual feature extractors, and motion estimators as opposed to learning those from scratch for every new task. We opt for a similar transfer of machine vision knowledge during imitation for robotic agents.

Our code and videos are available at

2 Related Work

Visual imitation learning

Imitation learning addresses the problem of learning skills by observing expert demonstrations [6]. However, most previous approaches assume that expert demonstrations are given in the workspace of the agent, (e.g., through kinesthetic teaching or teleoperation [12, 13]) and the actions/decisions of the expert can be imitated directly [14]. Imitating humans based on visual information is much more challenging due to the difficult visual inference needed for fine-grained activity understanding [5]. In this case, a mapping between observations in the demonstrator space to observations in the imitator space is required and is essential for successful imitation [15]. Numerous works bypass the difficult perception problem by using special instrumentation of the environment, such as AR tags, to read off object and hand 3D locations and poses during video demonstrations, and use rewards based on known 3D object goal configurations [16, 17]. Other works use hand-designed reward detectors that work only in restrictive scenarios [18]. Direct matching of pixel intensities is not a meaningful measure of similarity [19] as it is easily spoiled by difference in viewpoints, human body versus robot body parts, illumination changes, or changes in object poses.

Recent approaches attempt to attempt instead to learn such visual similarity by training and matching whole image feature embeddings directly, and avoid explicit extraction of the scene structure in terms of objects and their 3D poses. Numerous objectives have been proposed to learn full image or image sequence convolutional image embeddings, such as multiview invariant and time-contrastive objectives in [2, 20], forward and inverse dynamics model learning in [4, 21], or reconstruction and temporal prediction objectives in [22, 23]. Work of [24] provides an overview of common objectives and inductive biases for state representation learning. However, the data used to train such image embeddings are both human video demonstrations as well as robot executions of the task

, or parts of the task, so that the neural network embedding function learns to be robust to the presence of the robotic gripper or human hand. Yet, the requirement of the robot executing the task beats the purpose of visual imitation, and brings it closer to kinesthetic teaching. The same holds for recent work of

[25], which learns an image encoding via frame prediction using robot’s execution data, albeit the paper title. Instead, learning our graph video encoding does not require robot executions.

Similar to our work, work of [26] also uses 3D spatial object arrangements to guide visual imitation of manipulation tasks. However, they do not consider human keypoints or any entities finer than objects, which suggests their method can only imitate simple translation tasks, where object pose is not relevant (e.g., they cannot handle rotation). Work of [27] utilizes human pose detectors to imitate 3D human motion extracted from YouTube videos of acrobatic activities in simulation. However, no contact with objects is considered, and imitation of human motion only happens in a simulated agent. In comparison, we do not imitate motion alone, but rather, we carry out a desired manipulation of the environment. Human hands are part of the graph we attempt to create, but so are the surrounding objects in the scene.

Scene graphs, object-centric reinforcement learning, and relational neural networks

Representing a visual observation in terms of objects or parts and their pairwise relations has been found beneficial for generalization of action-conditioned scene dynamics [28, 29, 30], body motion and person trajectory forecasting [31, 32], and reinforcement learning [33]. Such graph-encodings have also been used to learn a model of the agent [34], and use it for model-predictive control in non-visual domains. Work of Devin et al. [35] uses pretrained object detectors and learns attention over the obtained detection boxes, which are incorporated as part of the state representation for policy learning. The graph representation we propose in this work not only employs explicit attention to relevant objects, parts, and points, but also preserves their correspondence in time, i.e., the detectors bind with specific objects, parts, points over time.

3 Formulation

We encode a demonstration video of length provided by the human expert and an imitation video of the same length provided by the imitator in terms of two graph sequences and , respectively. We omit the subscript or when either it is clear from the context or it is not important to which workspace we are referring to. A node corresponds to the th visual entity and its respective 3D world coordinate , and an edge correspond to a 3D spatial relation between two node entities to be preserved during imitation. We define a visual entity node to be any object, object part, or point that can be reliably detected in the demonstrator’s and imitator’s workspace. All nodes are in one-to-one correspondence between the demonstrator and imitator graph, as shown in Figure 1. An entity can dynamically appear and disappear over time. We only require each entity associated with the demonstration sequence to have a corresponding entity in the imitation sequence.

We consider three types of nodes: object nodes , point nodes , and hand/robot nodes . An object node represent any rigid or non-rigid object that constitutes a separate physical entity in the world, while a point node represents any 3D physical point on an object, as seen in Figure 1. Hand/robot nodes represent the human wrist 3D location, and the robotic gripper center 3D location. We do not consider edges between point nodes. Rather, each point node is connected only to the object node it is part of. In that sense, our graphs are hierarchical.

Our cost function at each time step measures visual dissimilarity across the demonstrator and imitator graphs and in terms of relative spatial arrangements of corresponding entity pairs, as follows:

Figure 2: Detecting visual entities. We use human hand keypoint detectors, multi-viewpoint feature learning, and synthetic image generation for on-the-fly object detector training from only a few object mask examples. Using a manually designed mapping between the human hand and the robot, the visual entity detectors can effectively bridge the visual gap between demonstrator and imitator environment, are robust to background clutter, and generalize across different object instances.

where is a binary attention function that determines whether a particular edge is present depending on the motion of the corresponding nodes, and denotes edge weights. We tie weights across all edges of the same type, namely, object-hand edges, object-object edges and object-point edges, they are hyper-parameters of our framework and we set them them empirically. Learning to adjust those weights per task is an interesting and straightforward direction, yet it would require more interactions which we cannot afford in the real robot platform. We leave this for future work.

3.1 Detecting Visual Entities

We define a visual entity to be any object, object part, or point that can be reliably detected in the demonstrator’s and imitator’s workspace. For imitating fine-grained manipulation of an object, inferring the translation of its bounding box is not enough, rather, the object’s 3D pose and deformation needs to be inferred and imitated. A central design choice in our work is using point feature detectors and motion of the detected points to infer the object’s change of pose between demonstrator’s and imitator’s environments, without additional learning, as opposed to training object appearance features extracted within the object’s bounding box to encode object change pose or deformation. We opt for what-where decomposition of the object’s appearance, as opposed to whole object box embedding learning.

We assume no access to human annotations that would mark relevant corresponding entities across demonstrator’s and imitator’s environments. We instead train scene-specific object and point detectors for entities that can be reliably recognized across demonstrator’s and imitator’s workspaces, and human hand keypoint detectors for tracking the human hand. The point detector re-samples points at each step randomly on detected area of objects in demonstration and computes corresponding points in the imitator’s view, and is thus robust to partial occlusions. In case of full occlusion, our hand and object detectors use last known location in the past. Thus, our detection pipeline possesses certain robustness to object occlusions and possible detector failures.

Human hand keypoint detectors

We make use of state-of-the-art hand detectors of Simon et al. [36] to detect human finger joints, and obtain their 3D locations using a D435 Intel RealSense RGB-D camera. We rely on forward kinematics and a calibrated camera with respect to the robot’s coordinate frame to detect the 3D locations of the tips of the robot’s end-effector. We map the finger tips of a Baxter robot’s parallel-jaw gripper to the demonstrator’s thumb and index finger tips. We detect grasp and release actions by thresholding the distance between the two finger tips of the human during the demonstration of the task. End-to-end approaches such as [37] rely on large amounts of data to train hand-to-robot correspondences, and are therefore prohibitive in few-shot learning scenarios, where are method works thanks to the thousands of labelled hand examples the human hand detectors of [36] has seen.

Point feature detectors from cross-view correspondence

An agent that has access to its egomotion and observes a static scene from multiple views can infer visual correspondences across views through triangulation [38]

. We use these self-generated visual correspondences to drive visual metric learning of deep feature descriptors that are robust to changes in the object pose or camera viewpoint. After training, we match such point features across imitator’s and demonstrator’s environments to establish correspondence

[1]. We collect multiview image sequences of the workspace of the robotic agent in an automatic fashion: we use an RGB-D camera attached to the robot’s end-effector and move the camera while following random trajectories that cover many viewpoints of the scene and at various distances from the objects. We use the robot’s forward kinematics model to estimate the camera poses via hand-eye calibration, which, in combination with the known intrinsic parameters and aligned depth images, allows for robust 3D-reconstruction of the scene and provides accurate pixel correspondences across different viewpoints. The complete feature learning setup is illustrated in Figure 2(b). During training, we randomly sample image pairs and generate a number of matching and non-matching pixel pairs. We then minimize pixel-wise contrastive loss [39, 40], which forces matching pixels to be close in the learned feature spaces, while maintaining a distance margin for non-matching pixels. This point feature learning pipeline produces a mapping from an RGB image to dense per-point descriptors. Though supervision comes always from within-instance correspondences, due to the limited capacity of the network model, generalization across different objects are expected. This enables our VEG representation with powerful generalizability to novel objects unseen in demonstration, as shown in Figure 2(b)

. We use ResNet-34 as our backbone, and learns a 4-dimensional point embedding vector for each pixel in the image.

Synthetic data augmentation

We use background subtraction to propose object 2D segmentation masks, and train a visual detector for each mask using synthetic data augmentation, as shown in Figure 2. Specifically, we create a large synthetic dataset by translating, scaling, rotating and changing the pixel intensity of the extracted RGB segmentation masks. The object masks often partially overlap with one another in the synthetic images. These overlaps help the agent learn to detect amodal object boxes under partial occlusions. Since we generate such images, we automatically know the groundtruth bounding box and mask that correspond to each object in each image. We then finetune a Mask R-CNN object detector [41]—initialized from weights learnt under the object detection and segmentation task in MS-COCO—to predict boxes and masks for the synthetically generated images.

3.2 Motion Saliency for Dynamic Graph Construction

We use similar motion saliency heuristics to decide dynamically over time what edges

to consider in our imitation cost function of Eq. 1, by setting to denote edge presence. We define an anchor object to be any object in motion, and in the case of no moving objects—e.g., when the demonstrator is simply reaching towards an object—we define the anchor box to be the closest in the future moving object. We consider edges between the anchor object node and all other corresponding object nodes in the scene, as well as the hand/robot node, and all point nodes that belong to the anchor object. This type of motion attention is a well-established principle that drives human attention when imitating other humans. Work of [42] uses AR markers to estimate such task relevance based on motion attention. We simply use our object detection network to track objects in time to infer movement between the different objects in the scene.

3.3 Policy Learning with Visual Entity Graphs

Our goal is the robot to imitate the intended object manipulation task from a single human visual demonstration. We formulate this as a reinforcement learning problem, where at each time step the cost is given by Eq. 1. We use PILQR [9] to minimize the cost function in Eq. 1, a state-of-the-art trajectory optimization method that combines a model-based linear quadratic regulator with path integral policy search to better handle non-linear dynamics. We learn a time-dependent policy , where the time-dependent control gains are learned by alternating model-based and model-free updates, where the dynamical model of the a priori unknown dynamics is learned during training time. The actions are defined as the changes in the robot end-effector’s 3D position and orientation about the vertical axis, giving a 4-dimensional action space. The state representation—over which we learn linear dynamics— consists of the joint angles, end-effector position, and the graph configuration of the scene, concatenated into one vector. For objects, this featurization scheme results in a dimensional state space, where denotes the number of joints of the robot and encodes the chosen node feature for a visual entity in the graph. In our case, we perform uniform sampling of pixels at each time step, and we directly incorporate the averaged pairwise distance of all pixels across demonstration and imitation into the state space, yielding . We further use behavioral cloning to infer opening and closing of the robot gripper by thresholding the distance between the human index finger and thumb during the demonstration.

4 Experiments

(a) Pushing
(b) Pouring
(a) Pushing
(c) Stacking and cost comparison
Figure 3: Examples of the imitation tasks we consider. (a): pushing, (b): pouring (c): stacking and cost comparison of the proposed VEG cost function and the TCN embedding cost [2]. The VEG based cost curves are highly discriminative of the imitation quality for the stacking task, in contrast to TCN. Costs are scaled between by dividing by the maximum absolute value of the respective feature space for visualization purposes. correct_end is simply repeating the last frame of a successful imitation, showing our lost function serves well as an attractor towards the final goal configuration.

We test our visual imitation framework on a Baxter robot. We consider the following tasks to imitate: i) Pushing: The robot needs to push an object while following specific trajectories. ii) Stacking: The robot needs to pick up an object, put it on top of another one and release the object. iii) Pouring: the robot needs to pour liquid from one container into another.

For every task, we train corresponding object detectors using synthetic data augmentation and point features using multi-view self-supervised feature learning. Note that both processes are fully automated and do not require human demonstrations or robot interactions. Rather, a camera that is setup to move around the scene in a prerecorded fashion suffices.

We compare our method against time-contrastive networks (TCN) of Sermanet et al. [2]. We are not aware of other instrumentation-free methods that have attempted single shot imitation of manipulation skills, without assuming extensive pre-training with random actions for model learning [11]. TCN trains an image embedding network using a triplet ranking loss [43] ensuring that temporally near pairs of frames are closer to one another in embedding space than any temporally distant pairs of frames. In this way, the feature extractor focuses on the dynamic part of the demonstration video. Implementation We implemented the TCN baseline using the same architecture as in [2], which consists of an Inception-v3 model [44]

, pretrained on ImageNet, up to the “Mixed 5d” layer which is then followed by two convolutional layers, a spatial softmax layer and a fully-connected layer. The network finally outputs a 32-dimensional embedding vector for the input image. For each imitation task, we train a corresponding TCN using 10 video sequences, 5 human demonstrations and 5 robot executions while performing tasks with relevant objects and environment configuration, together with the human demonstration we provide for learning the policy with VEG. With respect to the policy learning with TCN, we use a cost function of the form

, where we choose , , and . Here, and denote the state embedding at each time step for the imitation and demonstration, respectively. While our method uses only a single human demonstration, it does require training object detection and point-feature network (both the data synthesizing for the object detector and the data collection for the point-feature network are fully automated though). In order to have a fair comparison, we put same amount of effort into data collection for training TCN.

Our experiments aim to compare the proposed graph structured encoding against convolutional image encodings of previous work [2] for imitating skills, evaluate the robustness of our method against detectprs’ failures and occlusions, evaluate its robustness to variability in the objects’ spatial configurations and background clutter and its generalizability across objects with different shapes and textures.

Reward Shaping

In Figure 2(c), we show reward curves for our method and the TCN baseline for each robot execution that measure how well the robot is imitating the human demonstration. The horizontal axis denotes time and the vertical axis denotes imitation cost. The proposed graph-based cost function correctly identifies all correct robot imitations, despite heavy background clutter in the 5th row, and correctly signals the wrong imitation segments in 2nd and 3rd rows. In contrast, the baseline TCN cost curves are non-discriminative. Highly discriminative cost curves are critical for effective policy learning, which we discuss right below.

Task - Pushing

In this task, the robot needs to push the yellow octagon towards the purple ring following the trajectory showed by the demonstrator (Figure 2(b)). We evaluate three task variations: i) straight-line: Pushing the object following a straight line. ii) straight-line-grasped: Moving the yellow octagon along a straight line with it being always grasped. iii) direction-change: Pushing the yellow octagon along a trajectory with a sharp direction change of 90 degrees. Imitating such direction change requires the robot to change the point of contact with the object during pushing. For straight-line pushing, we attempted imitation in two different environments, one with a yellow octagon block, and one with a smaller orange square block to evaluate the generalization capability of our graph-based framework across objects with different geometries. We essentially use the object detector for the small orange square block in the second case, in place of the detector of the yellow octagon block.

Table 1: Success rates for the pushing and stacking tasks.

In order to test the robustness of VEG to variations in the objects’ spatial configurations, for the basic straight-line pushing task, we randomly perturb the starting configurations of all the objects and the robot’s end-effector within a norm ball of diameter 6cm, and run the policy training 5 times for each object. With the VEG representation, the robot’s policy converges after 6 iterations in all tasks. We consider the task solved if the robot is able to push the object within 1cm of the desired target position. We report the success rate for each case in Table 1(a). The robot successfully solves the task for all 5 runs for the original yellow octagon, and for 4 runs out of 5 for the smaller orange square, demonstrating robustness against perturbation in objects’ spatial configuration and strong generalization over objects with novel geometries. TCN failed in almost all runs, and for simple straight-line pushing could not reach even a quarter to the target even with significantly more iterations. For the straight-line-grasped task, the robot is forced to keep the gripper closed. Hence, the robot cannot simply hard-imitate the hand trajectory, but rather needs to follow a trajectory that differs from the human demonstrator to successfully solve the task. In this sense, the hand trajectory of the human serves as a weak guiding signal during the policy learning. The robot solves both straight-line-grasped and direction-change pushing after 8 iterations of trajectory optimization. Note that we increased the hand edge weight to account for stronger guidance of the hand during the direction-change task. TCN fails to solve any of the tasks, and in general performs very poorly in tasks where the robot does not have continuous contact with the object.

Task - Stacking

In this task, the robot needs to reach and grasp the yellow octagon, position it on top of the purple ring, and then release it, as shown in the 1st and 4th rows in Figure 2(c). As in our pushing experiments, we randomize the starting configurations of the scene and the robot, and evaluate our method on this augmented setup for stacking both yellow octagon and the orange square. We report the results in Table 1(b). VEGs enables robust policy learning for the yellow octagon, and shows generalization towards objects with novel geometries that have not been observed during demonstration. Being flatter than the octagon, the orange square is much harder to grasp, yet the robot is able to learn the skill successfully most of the times. The TCN baseline is not able to learn a successful grasping action, which requires precise and coordinated end-effector movement. We then evaluate TCN in a simpler task: stacking an already-grasped yellow octagon (simple-stacking in Table 1(b)). The TCN performs reasonably well on this task, demonstrating its ability to imitate smooth trajectories, but difficulty in fine-grained grasping actions.

Task - Pouring

In this task, the robot needs to simultaneously translate and rotate the yellow can to reach the desired orientation above the mug, as shown in 2nd row in Figure 2(b). Using the same pouring demonstration, we additionally evaluate generalizability of VEGs by imitating using a novel object with a different shape and texture (Figure 2(b) 3rd row). Trajectory optimization converges after 10 iterations, and solves the task successfully for all the objects. The TCN baseline is able to move the can along the trajectory, but fails to rotate it to the right configuration for successful pouring.


Imitating the human hand critically improves performance during reaching and grasping motions in human demonstrations, which cannot be easily inferred from any object motion. Our method relies on unoccluded areas of the scene, and there is no current attempt to learn to keep track of occluded entities. In light of this, a clear avenue for future work is learning to track through occlusions, and use a temporal visual memory to extract the reward graph from, as opposed to relying on visual frames. Active vision can also be used both to undo occlusions during imitation, and to observe the demonstrator from the most convenient viewpoint for the imitator. Another limitation is the lack of any model information, which would much accelerate policy search. An interesting avenue for future work would be learning a model over motion of such visual entities, and use it as a prior for better exploration during policy search.

5 Conclusion

We proposed encoding images in terms of visual entities and their spatial relationships, and used it to compute a perceptual cost function for visual imitation. Between end-to-end learning and engineered representations, we combine the best of both by incorporating important inductive biases regarding object fixation and motion saliency in our robot imitator. Experimental results on a real robotic platform demonstate the generality and flexibility of our approach. Quoting the authors of [45], “just as biology uses nature and nurture cooperatively, we reject the false choice between “hand-engineering” and “end-to-end” learning, and instead advocate for an approach which benefits from their complementary strengths.”