The 21st century has seen a considerable growth in robotics technologies. More and more advanced robots are being developed every day. From room cleaning robot Roomba  to advanced medical robot DaVinci , robots are making way into human lives. It is widely expected that the robots will replace humans in many of the daily labor intensive and repetitive tasks in a very near future.
But one of the major challenges faced by even the most advanced robotic technologies lies in its inability to learn new tasks and skills like human beings, i.e from experiences and from demonstrations. The traditional method of teaching robots to perform new tasks is by programming each aspect of the task line by line. But it is not practical to program each and every task in such an detailed manner.The real solution for this problem lies in developing learning technologies for robots that could learn and then act just like human beings i.e. from examples and demonstrations. A 3 year old kid is taught something new by its parents not by algorithmic programming but rather by simple demonstrations. The kid is shown what to do by its parents and it learns to imitate from these behaviors.
We term this as observation learning problem. In observation learning, the robot is made to observe a demonstrator performing a particular task. It is then made to learn to perform this task from these observations only. The novelty of this approach lies in the learning process where the robot learns a new task by only observing a demonstration without being explicitly programmed, there by providing a robot with human like learning capabilities. A detailed formulation of the problem of observation learning is given in the next two sections.
Observation learning has close similarities to the imitation learning problem (also called as apprenticeship learning/ programming by demonstration)  
which has been studied extensively using techniques like Inverse reinforcement learning and Behavioral cloning . It can be seen in literature that these two problems are not well defined from each other separately and the terms are even used interchangeably. But observation learning clearly differs from imitation learning in the aspect that in imitation learning, the learner (the robot) can observe the demonstrator in an ego centric (first person) view, whereas in observation learning the learner can only observe the demonstrator in allocentric (third person) view. Hence the problem of observation learning has to be treated separately from imitation learning and separate tools and techniques have to be developed for solving it efficiently. In the literature, observation learning is also referred to as third person imitation learning  and imitation from observation . But in this article we coin this new term observation learning to differentiate it from imitation learning and indicate the unique nature of this problem. In the following sections we explain in detail the problem of observation learning, along with its mathematical formulation and literature survey.
Observation learning could be defined as the process of learning to perform new tasks by visually observing the demonstrations by a demonstrator performing the same task. It aims at mimicking the learning process by which humans learn new skills. The general set up of an observation learning setting is show in the Fig 1.
The observation learning problem has 2 objects and 2 process involved. The two objects involved are the demonstrator and the learning agent:
Demonstrator: It is the object that performs the demonstration of the activity/task.
Learning agent: Learning agent is the object that learns a particular task from observing the demonstrator.
The two processes involved are the observation and learning processes:
Observation: This is the process by which the learning agent views the demonstrator. It is desirable that the learning agent does not capture any predefined features like hand motion of the demonstrator but rather captures only the raw image/video input of the demonstrator. This will help to model the problem of observation learning much more similar to the human learning process where no explicit feature tracking is used.
Learning: In this process the learning agents learns from the demonstrations it has observed from the demonstrator. Once learned, the learning agent then performs the same action. The learning process here primarily involves the process of understanding the mapping from the observations made by the learning agent to the control variables of the manipulator of the robotic system that are required to replicate the task observed. The theories about human learning process from cognitive science and neuroscience could be incorporated in this stage for designing advanced learning processes for the learning agent.
Let the set represent a video sequence of the demonstration example performed by a human, where denotes the video frame and denotes the length of the video sequence. Let
be the control variables vector representing the joint configuration, torsion, speed and acceleration of the robotic manipulator that determines the movements of the robotic manipulator.
The solution to the observation learning problem has to learn a mapping where is the video demonstration by the human demonstrator and is the control variables of the robotic arm.
Since a direct mapping is difficult to learn it would be desirable to first learn a mapping to an intermediate features space which would be a view point invariant feature space and then map from that features space to the robotic control variable feature space .
So the original mapping could be split into two: and .Mapping will first create a mapping to a view point invariant space from . The specialty of the feature space is that, each point in that feature space represent a particular task independent of the view point. So if & represent the video of an action taken from two viewpoints, the mapping should map both & to the same point in , i.e .
Once this feature representation in is obtained for an action then mapping will map this point representing an action to the control variables space of the robotic manipulator, giving the joint configurations, speed, torsion and acceleration required by the manipulator for performing the same action.
Thus by learning these two mappings M1 and M2 the system effectively learns to perform an action just by observing the demonstrator. The Fig 2 shows the pictorial illustration of this mappings :
This section presents a summary of the existing methods for tackling the observation learning problem. As described in the introduction, the observation learning problem has a considerable difference from the problem of imitation learning and hence is not included in this literature survey. In general observation learning can be divided into two categories:
Implicit observation learning
Assisted observation learning
Implicit observation learning
Implicit observation learning is the process of observation learning from the raw visuals of the demonstrations performed by the demonstrator. The learning algorithm employed by the learning agent will be responsible for extracting the required information/features from the visuals of the demonstrations to learn the particular task being demonstrated.
Stadie et al.  presents a generative adversarial network (GAN)  based approach for tackling the implicit observation learning problem. This method is an extension to the work by Ho and Ermon  to third person settings. The proposed method casts the observation learning problem as an inverse reinforcement learning (IRL) problem and uses the basic idea of GANs to solve this IRL problem.
In this paper, a GAN based method is used for inferring the reward function which is then used for training the learning agent. The discriminator in the GAN settings tries to distinguish between the two pair of observations received by the learning agent itself. These two pairs of observations are: 3rd person observation of the actions performed by the demonstrator and the ego centric observations of the actions performed by the learning agent. The inverse of the loss of the discriminator will serve as the reward to the reinforcement learning employed in the learning agent, which will try to reduce this lose function (there by maximizing the reward). As this loss function reduced the difference between the actions made by the learning agent and demonstrator reduces there by the learning agent acquiring the desired behavior.
A similar approach is also presented by Liu et al.  for tackling the problem of implicit observation learning. Here the observation leaning problem is again treated as an inverse reinforcement learning problem where 3rd person observation of the demonstrator is given and the learning agent has to infer a reward function from these observations. Here a reward function is extracted by using a process of context translation. The framework of context translation is shown in Fig 3.
The proposed method can be divided into to 3 stages:
Context translation: This is the first stage. Here the 3rd person observation of the demonstrator made by the learning agent is converted into the ego centric view of the learning agent using a encoder- decoder network. It is marked as stage 1 in Fig 3.
Obtaining reward functions: In this stage the reward function is inferred. The ego centric view of the learning agent performing the action is compared with that of the context translated view of the demonstrator. The similarity between the features extracted from these two images is then calculated and is taken as the reward function. The same decoder in context translation is used for extracting the features for calculating the similarity between the images. The hypothesis behind this is that as the as the learning agent tries to increase the similarity of its actions with respect to that of the demonstrator, it will learn to perform action that are similar to that of the demonstrator and there by learning the demonstrated task.
Reinforcement learning using the reward function obtained: Once the reward function is extracted, it is then used for training the reinforcement learning algorithm used by the learning agent. The RL algorithm will find a policy that will maximize the reward that was obtained in stage 2.
But these methods have several drawbacks. The GAN based method  is only studied in simulated RL experiments. There is a large domain shift between the simulated world and the real world. Similarly the context translation method  also suffers the credibility in real world conditions. Even though few real world experiments are presented in the paper, those experiments are conducted in highly constrained environments. Also it requires large number of demonstrations for learning a single task, which is difficult to collect. Again in general it is difficult to get the reinforcement learning working in situations where high dimensional inputs like raw images are fed as observations. A possible solutions to this is to use non-RL based methods.
Sermanet et al.  presents a totally new approach for solving the problem of implicit observation learning. This method does not use any reinforcement learning like the GAN based method  or the context translation method , rather uses an auto encoder method to learn the direct mapping from observation of the human demonstrator to the joint angles of the robotic manipulator. The Fig 4 shows the general frame work of the proposed method.
The observation encoder encodes the raw RGB observations made by the system into a feature vector in the feature space. The joints decoder decodes this embedding to joint configurations which could be fed to the learning agent.
The proposed method employs 3 step supervision learning for learning this observation to joints angle mapping, which are described below:
Time contrastive supervision learning: The time contrastive (TC) supervision is used for learning a view point invariant embedded feature representation of the observations made by the learning agent. The view point invariance helps the system to learn the action being performed by the observer independent of the view point of observation. TC supervision employs a triplet loss  that brings together the similar actions from different viewpoints in the embedded space.
Human super vision learning: Human supervision learning is for learning the mapping between the third person observations of a human demonstrator performing an action and the joint configuration of the robot while performing similar actions. This helps the system to understand a very important mapping which is crucial for the task of observation learning, i.e the mapping between human motions and robotic motion. This mapping is very important because of the difference in the structural configurations of the human body and the robotic body. Fig 5 shows this characteristics.
Self-super vision learning: The self-supervision learning is used for learning the mapping between joint angles of the robotic manipulator and the 3rd person view of itself. This helps the system to learn the direct correspondence between the joint angle movements the 3rd person observations of itself performing this action. The Fig 6 demonstrates this supervision learning process.
The net training process combines these 3 supervision learnings alternatively to learn to create an efficient mapping between observations of the demonstrator made by the learning agent to the joint angles of the learning agent.
This works comparatively better than the previous methods in real world conditions in much less constrained environments. However training process for this method is rather difficult. It requires video samples for training in which the human demonstrator has to imitate the actions performed by the robotic manipulator so that the system can learn a human joint-robotic manipulator joint mapping. It is practically difficult to collect such samples as the human needs to imitate the actions performed by the robot exactly in the same. This exact imitation of a robot movement is time consuming and very tedious.This limits the applicability of this method.
Assisted observation learning
In assisted observation learning, the observation learning is performed by tracking a set of predefined features of the demonstrator while performing the demonstration process. The advantage of this class of methods is that the learning algorithm need not be intelligent enough to extract features by itself, as features to be extracted are hand engineered by the programmer. This makes the task much simpler. The tracked features could be hand movements of the demonstrator or predefined markers   .
In the method proposed by Hamabe et al. , assisted observation learning is performed by tracking the hand movements of the demonstrator using the skeleton data obtained from a Kinect RGBD camera. This motion is then further used to divide the demonstrated task into simpler components and is used for learning the task. In the method by Gupta et al.  a pair of LED’s are used for tracking the motions in demonstrations. Learning is then performed based on the trajectories tracked using these LED markers. In  pose detection is employed and the tracked feature are the poses of the demonstrator. These poses are then used to find a mapping between the action performed by the demonstrator and the joint configurations of the robotic manipulator system.
The downside of this category of observation learning methods is that, it is not always possible to extract hand engineered features for each set of tasks. The features extracted for a set of tasks need not work on another set of tasks and vice versa. More over humans do not perform observation learning by tracking a set of predefine features. Human cognition system is powerful enough to automatically learn features from raw visuals of the demonstrator performing demonstrations.
This article explains in detail the problem of observation learning along with its mathematical formulation and literature survey. It is one of the important puzzles in robotics that has to be solved for creating the next generation of consumer robotics systems. Equipping the robots with observation learning methods will make them capable of learning new skills by only visually observing a demonstrator. Even though researchers have been trying to address this problem for the past several years and are able to come up with several ways of tackling it, the field is still in its infancy and has a lot of room for development, especially in the implicit observation learning branch of observation learning problem. More advanced algorithms and methods have to be developed for giving robotic systems human like learning capability to learn new skills/tasks.
- Forlizzi and DiSalvo  Jodi Forlizzi and Carl DiSalvo. Service robots in the domestic environment: a study of the roomba vacuum in the home. In Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction, pages 258–265. ACM, 2006.
- Leven et al.  Joshua Leven, Darius Burschka, Rajesh Kumar, Gary Zhang, Steve Blumenkranz, Xiangtian Donald Dai, Mike Awad, Gregory D Hager, Mike Marohn, Mike Choti, et al. Davinci canvas: a telerobotic surgical system with integrated, robot-assisted, laparoscopic ultrasound capability. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 811–818. Springer, 2005.
-  Peter Pastor, Heiko Hoffmann, Tamim Asfour, and Stefan Schaal. Learning and Generalization of Motor Skills by Learning from Demonstration. URL http://h2t.anthropomatik.kit.edu/pdf/Pastor2009.pdf.
- Rahmatizadeh et al.  Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau Bölöni, and Sergey Levine. Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration. (CoRL):1–11, 2017. URL http://arxiv.org/abs/1707.02920.
- Lesort et al.  Timothée Lesort, Mathieu Seurin, Xinrui Li, Natalia Díaz Rodríguez, and David Filliat. Unsupervised state representation learning with robotic priors: a robustness benchmark. arXiv preprint arXiv:1709.05185, 2017.
Abbeel and Ng 
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004.
- Bojarski et al.  Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- Stadie et al. [2017a] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint arXiv:1703.01703, 2017a.
- Liu et al.  YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation. (Nips), 2017. URL http://arxiv.org/abs/1707.03374.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- Ho and Ermon  Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
- Stadie et al. [2017b] Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-Person Imitation Learning. (2008), 2017b. URL http://arxiv.org/abs/1703.01703.
- Sermanet et al.  Pierre Sermanet, Corey Lynch, Jasmine Hsu, and Sergey Levine. Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation. 2017. doi: 10.1109/CVPRW.2017.69. URL http://arxiv.org/abs/1704.06888.
- Schroff et al.  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015. URL http://arxiv.org/abs/1503.03832.
- Hamabe et al.  Takuma Hamabe, Hiraki Goto, and Jun Miura. A programming by demonstration system for human-robot collaborative assembly tasks. In Robotics and Biomimetics (ROBIO), 2015 IEEE International Conference on, pages 1195–1201. IEEE, 2015.
- Gupta et al.  Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterous manipulation for a soft robotic hand from human demonstrations. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 3786–3793. IEEE, 2016.
- Shon et al.  Aaron Shon, Keith Grochow, Aaron Hertzmann, and Rajesh P Rao. Learning shared latent structure for image synthesis and robotic imitation. In Advances in neural information processing systems, pages 1233–1240, 2006.