Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. Open-loop strategies for deformable object manipulation are often ineffective, since the material can shift in unpredictable ways 
. Perception of cloth and rope also poses a major challenge, since standard methods for estimating the pose of rigid objects cannot be readily applied to deformable objects for which it is difficult to concretely define the degrees of freedom or provide suitable training data. Despite the numerous industrial and commercial applications that an effective system for deformable object manipulation would have, effective and reliable methods for such tasks remain exceptionally difficult to construct. Previous work on deformable object manipulation has sought to use sophisticated finite element models [1, 3], hand-engineered representations [4, 5, 6, 7], and direct imitation of human-provided demonstrations [8, 9]. Direct model identification for ropes and cloth is challenging and brittle, while imitation of human demonstrations without an internal model of the object’s dynamics is liable to fail in conditions that deviate from those in the demonstrations.
In this work, we instead propose a learning-based approach to associate the behavior of a deformable object with a robot’s actions, using self-supervision from large amounts of data gathered autonomously by the robot. In particular, the robot learns a goal-directed inverse dynamics model: given a current state and a goal state (both in image space), it predicts the action that will achieve the goal. Once this model is learned, our method can use human-provided demonstrations as higher level guidance. In effect, the demonstrations tell the robot what to do, while the learned model tells it how to do it, combining high-level human direction with a learned model of low-level dynamics. Figure 1 shows an overview of our system.
Our method does not use any explicit parameterization of the rope configuration. Instead, we learn a model using raw images of the rope, which provides representational flexibility and avoids the need for manually specifying kinematic models that may fail to adequately capture the full variability of the deformable object. To handle high-dimensional visual observations, we employ deep convolutional neural networks for learning the inverse dynamics model.
Many interesting manipulation tasks require more than a single step to achieve a desired goal state. For example, tying a rope into a knot, stitching tissue during a surgery, and lacing shoes all involve multiple steps of manipulation.
Learning to perform these tasks is much harder than learning to perform small deformations, because only a very specific set of chained actions will result in success. While self-supervised learning of chained actions remains an open challenge, here we alleviate the problem by employing a small amount of imitation. At test time, the robot is provided a sequence of images depicting each step in the task. Because it has learned how to achieve small deformations between current and target images, it can follow the sequence and complete a multistep task.
One of the key challenges in self-supervised robot learning is collecting enough data for learning skilled behaviors, since the state and action space for practical manipulation tasks is extremely large. In comparison to rigid objects, deformable objects such as ropes can take on a much larger range of configurations. This means that for self-supervised learning of rope manipulation we require large amounts of interaction data. To this end, we configured a Baxter robot to autonomously collect interaction data with a rope without any human intervention. To date, we have collected over 500 hours worth of data which has been publicly released at the project website.
The main contribution of our work is to present a learning-based approach for rope manipulation that combines learned predictive models with high-level human-provided demonstrations. The predictive model is learned autonomously by the robot, using automatically collected data and self-supervision. This model allows the robot to understand how to manipulate a rope to reach target configurations. The human demonstrations, in the form of step-by-step images of a rope undergoing a manipulation task, can then be used at test time to indicate to the robot what should be done to the rope, and the learned model can then be used to determine how to do it. We evaluate our method on a Baxter robot trained on our dataset of over 500 hours of real-world rope manipulation, and demonstrate the ability of the system to arrange a rope into a variety of shapes using high-level human direction.
Ii Related Work
There has been significant recent interest in learning control policies directly from visual inputs using deep neural networks. Impressive results have been obtained on playing Atari games , a large suite of robotic tasks in simulation [11, 12, 13, 14, 15, 16] and real world robotic manipulation tasks involving rigid objects in controlled settings [17, 18]
. Given that state of the art deep reinforcement learning algorithms are data hungry, some recents works learned to grasp[19, 20] and push  real world objects by collecting large amounts of robot interaction data.
Manipulating deformable objects has been of great interest to the robotics community . Prior works have considered problems such as surgical suturing [8, 9], towel folding , knot tying and rope manipulation among many others. Rope manipulation and knot tying are most closely related to our work. Inoue et al.  investigated the problem of knot tying and following works used motion planning , fixtures for open-loop execution  and robotic hands with tactile feedback . Morita et al.  developed a system for tying knots from visual observations that makes use of knot theory  to identify a sequence of knot states and then execute motor primitives to achieve these states. Wakamatsu et al.  chose a sequence of robotic hand motion primitives from rope cross states inferred by a planner to achieve a desired knot from a given initial state of the rope. In contrast to these works, our goal is not to tie knots but to manipulate rope into a general configuration by watching a human as she manipulates the rope. Our system does not require any rope-specific knowledge and is therefore generalizable to manipulating other deformable objects.
Schulman et al.  used non-rigid registration  for transferring human demonstrated trajectories for rope manipulation. In the learning from demonstration (LFD) paradigm, an expert provides the robot with examples of observations and the associated motor trajectories used to perform the task. The robot then builds a model from this data that outputs a policy for achieving the same task at test time. A survey of learning from demonstration can be found in . One drawback of the LFD approach is that, if a robot is to perform a wide suite of tasks, an expert is required to provide demonstrations for each task individually. Instead, if robots learn to imitate human behavior by simply observing humans as they perform different tasks, robots could learn much faster. Many past works have proposed solutions to learning from visual demonstrations [29, 30, 4].
The closest to our work are Yang et al.  and Kuniyoshi et al. . Kuniyoshi et al. use a vision system to detect an object and the exact pose of human hands to produce a sequence of robotic actions to copy the human demonstration. Yang et al. predict object detections and grasp types that are then used to infer the robotic actions using an action parse tree. In contrast to these approaches that deal only with rigid objects, we consider the significantly harder problem of manipulating a deformable object. In addition to integrating human demonstrations, we let the robot collect supervised data to build a predictive model of its actions, which can then be used to infer how the robot can imitate the visual demonstration provided by the expert.
Iii Experimental Setup
We use a Baxter robot for all experiments described in the paper. The robot manipulates a rope placed on a table in front of it using only one arm. The arm has a parallel jaw gripper that can rotate, open and close. One end of the rope is tied to a clamp attached to the table. The robot receives visual inputs from the RGB channels of a Kinect camera. The setup is illustrated in Figure 1. The interaction of the robot with the rope is constrained to a single action primitive consisting of two sub-actions - pick the rope at location and drop the the rope at location , where are pixel coordinates in the input RGB image. It is possible to manipulate the rope into many complex configurations using just this action primitive, as show in Figure 2.
The robot collects data in a self-supervised manner by randomly choosing pairs of pick and drop points in the image. However, if we randomly choose a point on the image, then most points will not be on the rope. Instead, we use the point cloud from the Kinect to segment the rope and then choose a pick point uniformly at random from this segment. Once the pick point is chosen, the drop
point can be obtained as a displacement vector from thepick point. We represent this displacement vector by the angle and length cm. Values of and are uniformly and randomly sampled from their respective ranges to obtain the drop point. After choosing the pick and drop points, the robot executes the following steps: (1) grasp the rope at the pick point, (2) move the arm 5 cm vertically above the pick point, (3) move the arm to a point 5 cm vertically above the drop point, (4) release the rope by opening the gripper. The pair of current and next image obtained after executing the random action are used for training the inverse model described in Section IV.
During autonomous data collection, it is very likely that the rope will fall off the table or drift out of reach of the robot and consequently halt the data collection process until a manual reset if performed. For continuous collection of data without human intervention, the robot performs a reset action either after every 50 actions or if there are fewer than 1000 pixels of rope in the robot’s workspace. The reset action detects the end of the rope and then pulls the rope in a fixed direction to make it straight. This system can run continuously for stretches of more than twenty hours without any manual intervention.
Iv Method for Rope Manipulation
Our goal is to have the robot watch a human manipulate a rope and then reproduce this manipulation on its own. The human provides a demonstration in the form of a sequence of images of the rope in intermediate states toward a final goal state. Let V = represent this sequence. The robot must execute a series of actions to transform into , then into , and so on until the end of the sequence.
A model that predicts the action that relates a pair of input states is called an inverse dynamics model, and is mathematically described in equation 1 below:
where and are images of the current and next states and is the action. We use convolutional neural networks to learn the inverse model in a manner similar to [21, 32]. Details of training the inverse model and the process of imitating an observed visual demonstration are described in Sections IV-B and IV-C, respectively.
Iv-a Neural Network Architecture Notation
Let the abbreviations Ck, Fk represent a convolutional (C) layer with k filters, a fully-connected (F) layer with k filters respectively. We used the ELU (Exponential Linear unit) non-linearity after every convolutional/fully-connected layer, except for the output layer. As an example of our notation, C48-F500 refers to a network with 48 filters in the convolution layer followed by ELU non-linearity and a fully-connected layer with 500 units.
Iv-B Self-Supervised Learning of the Inverse Model
Our neural network architecture consists of two streams that transform each of the two input images into a latent feature space, . The architecture of these streams is C96-C256-C384-C384-C256-C200. The first five out of the six layers have the same architecture as AlexNet. The neural network weights of both the streams are tied. The latent representations of the two images, , each size 200, are concatenated and fed into another neural network with the architecture F200-F200. This subnetwork provides a joint non-linear feature representation of the current and next image that is used to predict the action.
For the purpose of training, we turn action prediction into a classification problem by discretizing the action space. The action is parameterized as a tuple , where is the action location, is the action direction and is the action length. Each dimension of this tuple is independently discretized. The action location is discretized onto a spatial grid, and the direction and length are discretized into 36 and 10 bins respectively. The predicted action tuple is represented as Treating action prediction as classification makes training the network easier and accounts for multimodality in the output distribution, which can occur when multiple actions exist that move the rope from the same initial to final configuration.
A naïve way of training the neural network would be to classify each of the three action elements independently. In order to model the joint discrete distribution of the action
in a way that does not increases exponentially in size with the number of dimensions, we decompose the joint distribution as. The neural network first predicts a distribution over 400 different possible pick locations. The of this distribution (
) is chosen as the pick location. One-hot encoding of the pick location is passed along with state features () to predict a distribution over possible action directions . Then the over the action direction distribution () is one-hot encoded and concatenated with the one-hot encoding of and state features to predict .
We initialize the first five layers of the network using pre-trained AlexNet weights obtained by training for image classification on the ImageNet challenge. For the first 5K iterations we set the learning rate of these convolutional layers to be 0. For the rest of the training we set the learning rate to 1e-4, and use the Adam optimizer 
. All the other layers are initialized with small normally distributed weights and trained from the first iteration with a learning rate of 1e-4. Approximately 60K pairs of before and after image collected autonomously by the robot were used for training the inverse model. A separate validation set of 2.5K before/after image pairs was used for hyper-parameter tuning.
Iv-C Imitating Human Demonstration
With the help of an inverse model, the robot can deform the rope by small amounts. Different specific sequences of such small deformations can be used to manipulate the rope in different ways. However, learning a chain of actions that manipulate the rope into configurations of interest such as knots, is non-trivial. In this work we mitigate this challenge by using visual demonstrations from a human. The robot receives as input the sequence of images depicting each stage of the manipulation performed by a human demonstrator to reach a desired rope configuration from an initial rope configuration. We denote this sequence of demonstration images as where depicts the goal.
Let be the initial image the be the goal image. The robot first inputs the pair of images into the learned inverse model and executes the predicted action. Let be the visual state of the world after the action is executed. The robot then inputs into the inverse model and executes the output action. This process is repeated iteratively for time steps. In some cases the robot predicts a pick location that does not lie on the rope. For these cases we use the rope segmentation information to find the point on the rope that is closest to predicted pick location to execute the pick primitive.
Iv-D Active Data Collection
With randomized data collection it is unlikely that the robot would place the rope in interesting configurations and therefore the model may not be accurate at performing tasks such as knot-tying that require moving the rope into complex configurations. In order to bias data collection towards interesting configurations, we collected a set of 50 images of the rope when it was manually arranged in a random configuration (i.e. the goal buffer). We then used a model trained with 30K randomly collected data points and instead of randomly sampling an action, we randomly sampled a image from the goal buffer and set that as the goal image. We passed the current and goal image into the inverse model and used the action predicted by the inverse model for data collection.
V Evaluation Procedure
The performance of the robot was evaluated by tasking it to reconfigure the rope from a given initial configuration into target configurations of varying complexity depicting “L”, “S”, “W” shapes and a knot.
The performance was quantitatively measured by calculating the distance between the rope configurations in the sequence of images provided from the human demonstration and the sequence of images achieved by the robot after executing the actions from the inverse dynamics model. The distance between two rope configurations is computed by first segmenting the rope in each image, then aligning points in these two segmentations with thin plate spline robust point matching (TPS-RPM)  and calculating the mean pixel distance between the matched points. We compare the performance of our method against a hand-engineered baseline, a nearest neighbor baseline, and our method without imitation (see Section V-A). Videos of the self-supervised data collection, the demonstrations, and autonomous executions are available at https://ropemanipulation.github.io/
Hand-Engineered baseline: The first baseline we compare against is a hand-engineered method that takes as input the sequence of images from the human demonstration. For inferring which action should be executed to transform the rope from the configuration in into the configuration in , we first segment the rope in both the images, and use TPS-RPM to register the segments. In the absence of a model of rope dynamics, a simple way to move the rope into a target configuration is to pick the rope at the point with the largest deformation in the first image relative to the second and then drop
the rope at the corresponding point in the second image. As the point with largest distance may be an outlier, we use the point at thepercentile of the deformation distances for the pick action.
Nearest Neighbor baseline: To evaluate whether the neural network simply memorized the training data, we compared our method to a nearest neighbor baseline. Given the current image () and the target image in the human demonstration (), a pair of images in the training set that is closest to is determined and the ground truth action used to collect this training sample is executed by the robot. As the distance metric for nearest neighbor calculation, we used Euclidean distance in raw RGB space after down-sampling images to .
No Imitation baseline: For understanding how critical is imitation for manipulating the rope into desired shape, we evaluated the learned model by feeding in only the initial and goal image () without any intermediate steps. The inverse model takes these images as inputs and the robot executes the predicted action and image is obtained. Next, the pair () is fed into the inverse model to infer the next action to execute and this process is repeated iteratively for the same number of steps as in the human demonstration.
Knot Tying Success Rate
Figure 4 qualitatively shows that using the learned inverse dynamics model, the robot is capable of re-configuring the rope into many different shapes. It can also be seen that when the rope is not bent sharply, the robot is more accurate at manipulation.
Figure 5 compares the performance of our method against the hand-engineered, nearest neighbor and no-imitation baselines described in Section V-A. Each subfigure shows the performance for a different target configuration. The x-axis in each subfigure corresponds to the number of intermediate images that were provided to the robot via demonstration. The y-axis corresponds to the TPS-RPM distance between the rope configuration achieved by the robot and the corresponding rope configuration in the actual demonstration. Lower values indicate better performance.
For every sequence, mean accuracy and standard deviation are reported across 10 different repeats of two human demonstration sequences. The results demonstrate that our method outperforms various baselines including a hand-engineered baseline (i.e. TPS-RPM baseline), indicating that through self-supervision, the robot has learned a dynamics model of rope motion that is useful for manipulation.
Vi-a Scaling with amount of data
shows that a model trained with 60K data points out of which 30K were collected using random sampling and other 30K using active sampling significantly outperforms a model trained with only 30K data points. Our method can successfully tie knots 38% of the time. Due to the lack of 60K randomly collected training points, at the moment we cannot conclude how useful is active data sampling as compared to random sampling. We will include these results in the next revision of our paper.
Vi-B Importance of Imitation
How important are human demonstrations for manipulating the rope into a desired configuration? Results in Figure 5 show that when the robot was only provided the initial and final images instead of all the images in the human demonstration sequence, the performance is significantly worse. Figure 6 further shows that without imitation robot is able to tie the knot only 11/50 times instead of 19/50 times with imitation.
Vi-C Generalization to other ropes
We tested if the learned model is successfully able to manipulate new ropes by qualitatively evaluating performance on a white jute rope that was significantly stiffer than the red rope used in the training and a black cotton rope that was significantly thinner and less stiff than the red rope used in the training. We found that the robot was successfully able to configure these ropes into relatively simpler “L” and “S” shapes. Even though the model was trained using interaction data from a single rope, it generalizes to other ropes. This shows that instead of learning features specific to the particular rope used for data collection, our model learns features that generalize to other ropes.
One possible reason that the robot is unsuccessful at manipulating the white rope into more curvy “W” and knot configurations is that the rope is too stiff to bent into curves necessary for forming the “W” and the knot. With the black rope the robot was able to successfully manipulate the rope until the penultimate step in tying a knot, but failed at completing the knot. We also ran experiments where we changed the background from green to white and found that our model was unsuccessful at rope manipulation. This result is not surprising because all our training data is collected using a single background and it is expected that with training data collected on diverse backgrounds our method would be able to generalize to novel backgrounds. The video demonstrations of these experiments are available at the project website.
Vii Discussion and Future Work
In this paper, we presented a learning-based method for flexible manipulation of deformable objects such as ropes. Our method is based around self-supervision: we train a predictive model of rope behavior using data of rope interaction collected autonomously by the robot. This model predicts, given the current and target image of a rope, which action the robot can execute to put the rope into the target configuration. We combine our method with human demonstrations by allowing a human user to supply step-by-step images that show the rope undergoing a variety of manipulations. These demonstrations tell the system what it should do with the rope at test time, while the learned model is used to determine how to do it. Our experiments indicate that we can use our learned model to arrange the rope into a variety of different shapes using only high-level demonstration images.
One limitation of our approach is that, in its current form, it cannot learn to manipulate new objects exclusively by watching human demonstrations, since performing a manipulation task requires a model that can effectively predict the motion of the object, and this model is learned from the robot’s own experience. In principle, this limitation could be overcome simply by collecting data from a large number of object manipulation scenarios, so as to learn a single model that generalizes effectively across objects. A more nuanced approach might involve correlating the behavior of objects in the human demonstrations with other previously experienced manipulations, so as to put them into correspondence and infer the behavior of an object for which prior experience is unavailable. In both cases, lifting this limitation is a promising direction for future work.
Although we demonstrate a variety of rope manipulation scenarios, our experiments are limited in scope, primarily due to limits on the amount of data that we can collect in reasonable time on a single robotic platform. For instance, most of our experiments use a single rope on a single background. Although we demonstrate that our model can successfully manipulate new ropes with significantly different material and texture properties into simple configurations, it fails to manipulate them into more complex configurations. If provided with substantially more robot-hours and a greater variety of ropes and environments, a similar model could in principle learn a more generalizable notion of rope manipulation. State-of-the-art results in image recognition and segmentation indicate that deep convolutional neural networks have the capacity to generalize to a wide range of scenes and objects when provided with enough data , which suggests that the same kind of generalization could in principle be obtained by predictive models of deformable objects with enough data.
Viii Supplementary Materials and Videos
Supplementary materials and videos of the robot’s performance can be found at https://ropemanipulation.github.io/.
This work was supported in part by NSF CAREER award (#IIS-1351028), NSF SMA-1514512, ONR MURI N00014-14-1-0671, and ONR YIP.
-  J. E. Hopcroft, J. K. Kearney, and D. B. Krafft, “A case study of flexible object manipulation,” The International Journal of Robotics Research, vol. 10, no. 1, pp. 41–50, 1991.
-  S. Miller, M. Fritz, T. Darrell, and P. Abbeel, “Parametrized shape models for clothing,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp. 4861–4868.
-  M. Bell, “Flexible object manipulation,” Ph.D. dissertation, Dartmouth College, Hanover, New Hampshire, 2010.
-  Y. Kuniyoshi, M. Inaba, and H. Inoue, “Learning by watching: Extracting reusable task knowledge from visual observation of human performance,” IEEE transactions on robotics and automation, vol. 10, no. 6, pp. 799–822, 1994.
-  H. Wakamatsu, E. Arai, and S. Hirai, “Knotting/unknotting manipulation of deformable linear objects,” The International Journal of Robotics Research, vol. 25, no. 4, pp. 371–395, 2006.
-  T. Morita, J. Takamatsu, K. Ogawara, H. Kimura, and K. Ikeuchi, “Knot planning from observation,” in Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, vol. 3. IEEE, 2003, pp. 3887–3892.
-  J. Maitin-Shepard, M. Cusumano-Towner, J. Lei, and P. Abbeel, “Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010, pp. 2308–2315.
H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and J. Schmidhuber, “A system for robotic heart surgery that learns to tie knots using recurrent neural networks,”Advanced Robotics, vol. 22, no. 13-14, pp. 1521–1537, 2008.
-  J. Schulman, A. Gupta, S. Venkatesan, M. Tayson-Frederick, and P. Abbeel, “A case study of trajectory transfer through non-rigid registration for a simplified suturing scenario,” in IROS, 2013.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783, 2016.
-  K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning visual predictive models of physics for playing billiards,” arXiv preprint arXiv:1511.07404, 2015.
-  N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: Policy learning with deep dynamical models,” CoRR, vol. abs/1502.02251, 2015.
-  M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in Advances in Neural Information Processing Systems, 2015, pp. 2728–2736.
-  J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,” arXiv preprint arXiv:1507.08750, 2015.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
-  I. Lenz, R. Knepper, and A. Saxena, “Deepmpc: Learning deep latent features for model predictive control,” in RSS, 2015.
-  L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” arXiv preprint arXiv:1509.06825, 2015.
-  S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” arXiv preprint arXiv:1603.02199, 2016.
-  P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” arXiv preprint arXiv:1606.07419, 2016.
-  H. Inoue and M. Inaba, “Hand-eye coordination in rope handling,” vol. 1, 1985.
-  M. Saha and P. Isto, “Motion planning for robotic manipulation of deformable linear objects,” in Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006. IEEE, 2006, pp. 2478–2484.
-  Y. Yamakawa, A. Namiki, M. Ishikawa, and M. Shimojo, “One-handed knotting of a flexible rope with a high-speed multifingered hand having tactile sensors,” in 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2007, pp. 703–708.
-  R. H. Crowell and R. H. Fox, Introduction to knot theory. Springer Science & Business Media, 2012, vol. 57.
-  J. Schulman, J. Ho, C. Lee, and P. Abbeel, “Generalization in robotic manipulation through the use of non-rigid registration,” in Proceedings of the 16th International Symposium on Robotics Research (ISRR), 2013.
-  H. Chui and A. Rangarajan, “A new algorithm for non-rigid point matching,” in CVPR, 2000.
-  B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, vol. 57, no. 5, pp. 469–483, 2009.
K. Lee, Y. Su, T.-K. Kim, and Y. Demiris, “A syntactic approach to robot imitation learning using probabilistic activity grammars,”Robotics and Autonomous Systems, vol. 61, no. 12, pp. 1323–1334, 2013.
-  Y. Yang, Y. Li, C. Fermüller, and Y. Aloimonos, “Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web.” in AAAI, 2015, pp. 3686–3693.
L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  L. Pinto, D. Gandhi, Y. Han, Y.-L. Park, and A. Gupta, “The curious robot: Learning visual representations via physical interactions,” arXiv preprint arXiv:1604.01360, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2015.
-  M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?” arXiv preprint arXiv:1608.08614, 2016.