A generalist robot that can operate alongside humans and perform a variety of tasks in unconstrained environments is a long standing vision of robotic learning. It is highly desirable for such robots to be capable of understanding instructions in natural language from untrained users. In this paper, we address the problem of programming robots using natural language.
. However, this is not the only way neural networks can be used for controlling robots. It is also possible to use sensor data such as the camera feed to construct a vector space representation of the world and then to plan a path in this space. For example, an object detector can be used to find all the objects in the scene which can then be used to determine the robot motion necessary to move the objects to particular positions. Although this introduces rigidity in the representation of the world, the advantages of this approach include modularity (the object detector can be replaced without modifying the rest of the system) and interpretability (the output of the object detector can be examined separately).
The majority of recent works on imitation learning have used some input device such as game controller, VR controller, visual odometry based 6-DoF position tracking using smartphones, space mouse, etc. to record experts teleoperating a robot. In this work, we take a different approach to collect expert demonstrations. We give a natural language instruction prompt and have experts write a Python function that controls the robot to accomplish the task specified in the instruction (Fig. 1). This function takes the output of an object detector as its argument and moves the end-effector of the robot arm to perform the specified task (Fig. 2). The dataset collected in this manner is used to train a neural network that takes a natural language instruction as input and predicts a Python function block which controls the robot when executed.
A few examples of the tasks we consider are: (a) Push the orange towards the apple, (b) Place the apple between the orange and the apple, (c) Pick up the orange and use it to push the bottle off the edge of the table. Although our robot does not use a force sensor and can only move the end-effector using position control, it is possible to expand the set of primitive instructions of the robot to include complex macro instructions such as peg-in-hole insert instruction that may invoke a separately trained policy network. Our approach is most suitable for “gluing” together simpler commands to compose a more complex program. A potential application for our method is in augmenting teach pendants to accept instructions in natural language.
There are several advantages of having expert demonstrations in the form of program code. One is that the expert program can invoke complex subroutines such as a constraint solver. It can be difficult to train an end-to-end neural network to copy the behavior of such complex modules. The other advantage is that the intention of the expert is clearer and less ambiguous in the program representation than in teleoperated demonstrations. For example, to “push the orange off the table”, the program to perform this task clearly indicates the robot motion for different possible positions of the orange, whereas, we would need many more teleoperated demonstrations each corresponding to a different position of the object to be able to train a neural network to reliably copy the expert behavior. Finally, the program representation is more interpretable and amenable to analysis before it is executed.
Our contributions are:
We propose an imitation learning setup where the expert demonstrations are in the form of program code and use a neural translation model to translate instructions in English to Python code that controls the robot.
We show that the proposed method performs better than directly mapping natural language instructions to actuation commands.
The rest of this paper is organised as follows. In the following section, related work is discussed. Section 3 defines the problem statement. In Section 4, the neural network architecture that we use is described in detail. Experimental results are discussed in Section 5, and Section 6 concludes the paper.
Ii Related Work
Several recent papers have demonstrated that it is possible to learn visuomotor skills from human demonstrations. Input devices such as VR controller, space mouse, visual odometry for 6-DoF position tracking using smartphones, etc. have been used to gather expert demonstrations. What is common to all of these approaches is that some input device is used to enable human experts to teleoperate the robot. In this work, we deviate from that approach by having experts indirectly control the robot by writing Python programs.
Understanding natural language in the context of the visual scene of the robot has been addressed by several papers. In , a robot system to pick and place common objects is built where the object is inferred from the input image and grounded language expressions. The problem of referring to objects in an unambiguous manner is addressed in . Although there may be ambiguities in the natural language input, the spatial relationships between objects are used to disambiguate the meaning and resolve the object being referred to. Understanding instructions provided in spoken language with incomplete information based on the context of the input image and common sense reasoning is addressed in . The authors in  propose a synthetic dataset for visual question answering to debug and understand weaknesses in different grounded natural language reasoning models. In , the Blocks dataset is proposed. This dataset contains instructions to move blocks such as ”Move block 6 north of block 8” along with the positions of all the blocks in the scene before and after the instruction has been executed. Our work also has an emphasis on spatial reasoning, but we go beyond moving around a single block or object.
Unlike the above mentioned works, the Learning from Play (LfP) approach in  is end-to-end imitation learning with the neural network directly controlling the actuators. This builds on goal-based imitation learning where a neural network prediction is conditioned on the current image observation and the desired target image. Rather than using the target image,  replaces it with a latent vector derived from the natural language input. In this paper, we use the more traditional imitation learning approach and have experts translate natural language instructions into Python code.
, a large dataset of human demonstrations (not teleoperated) is used to learn a video classifier that predicts the task being performed in the video. This classifier is then used as a reward function for reinforcement learning to train another network that takes the natural language instruction and predicts the desired goal pose. In this work, we do not use reinforcement learning or a reward function and instead use the programs written by the expert in a fully supervised learning setting.
Much attention is devoted to object detection in the computer vision literature. Although end-to-end imitation learning does not use object detection, it is also possible to use a pipelined approach where object detection is one module. For example, in , the pick-and-place task is performed by picking up the object at a grasp point and then bringing it near the camera for classifying to which bin the object should be placed in. In this paper, we use a fully convolutional object detector inspired by  to detect the positions and sizes of all the objects in the scene.
The problem of answering queries in natural language using data from a table is addressed in . There are broadly two approaches to this problem. One way is to approach this as a semantic parsing problem and to generate a logical form or a SQL query from the natural language input. The other way is to process the natural language instruction along with the contents of the table to directly predict the answer. The latter approach subsumes the process of running the query into the neural network itself. In this paper, we generate Python function blocks rather than SQL statements from natural language.
In , the authors propose generating code from documentation strings. In , a pre-trained model for programming languages is proposed. A “transpiler” that translates code from one language to another is proposed in 
. Although this paper also proposes generating program code from natural language, the end goal of controlling the robot is different. As a result, the evaluation metrics and baselines also differ. Moreover, our primary objective in this work is not to improve on code generation methods, but to show that generating code can outperform direct prediction of actuator commands.
Iii Task Description
We consider two different tasks where the task is specified using natural language.
Iii-a Arrange task
This task involves taking objects from a tray and placing them at different positions on the table. The instruction in natural language along with the width and height of all the objects are the inputs and the goal is to predict the positions of the objects on the table. The motion planning to pick up the object from the tray and place it at the specified location is performed separately (this is not learnt).
Figures 3 and 4 show sample programs that compute the positions of the objects for the given natural language instruction. The program uses the Cassowary constraint solver (which uses the simplex method) to declaratively specify constraints for the positions of the objects. Note that it’s not entirely declarative and the program can access the intermediate solution before declaring additional constraints (Fig. 4). After the program is executed, the positions of all the objects determined by the constraint solver is used to plan the pick-and-place motion of the robot arm.
Iii-B Manipulation task
This task involves manipulating objects on the table as specified by the natural language instruction. Typical tasks involve reaching for an object, pushing an object somewhere, and picking-and-placing an object. To control the robot, the action space is (a) to move the end effector of the robot to the specified position (x, y, z, r), and (b) to control the suction gripper (on/off). The robot can be controlled by emitting a sequence of end effector poses and grip commands. An object detector makes available the positions and sizes of all the objects. The goal is to take the positions and sizes of all the objects on the table and to emit a sequence of end-effector positions and gripper on/off commands.
Figures 5 and 6 show sample programs that control the robot to accomplish the task specified by the natural language instruction. Unlike the previous task, the objects are already on the table. Moreover, the program must not merely specify the desired state, but it must also directly control the robot to get to the desired state. So, the current positions of the objects are used to compute the appropriate actions.
Iv Network Architecture
The proposed architecture is shown in Fig. 7
. The natural language instruction is taken as input, and the neural machine translation model generates the Python program that performs the task specified by the instruction.
It uses an LSTM based neural machine translation model with attention. Unlike most language vision models, the neural network does not take the image observation as an input. Rather, the program generated by the network accesses the attributes of the objects detected and controls the robot based on that.
The input natural language instruction is tokenized, and the embeddings for the tokens are obtained using a pre-trained BERT model. Note that the BERT layers are frozen and remain unchanged during training. The input sequence embeddings are processed by an encoder LSTM with hidden states . After all the input tokens are processed, a decoder LSTM predicts the target sequence that is used to contruct the Python function body.
At each step of the decoder, the decoder state is used to attend to the input states and infer the context vector that is used to predict the output .
The variable length alignment vector of size equal to the number of steps in the input sequence is obtained by comparing the decoder hidden state with each of encoder hidden states :
The context vector is computed as the weighted average of the hidden states of the encoder :
The context vector and the decoder state are concatenated and passed through fully connected layers to predict the target sequence token .
We first evaluate the proposed approach in a simulated environment. Subsequently, we discuss the performance on a real robot arm.
V-A1 Arrange Dataset
The arrange task involves arranging objects on the table as specified by the instruction in natural language. The object positions may be specified as absolute positions or in terms relative to other objects placed on the table. For this task, we have collected the arrange dataset, a parallel corpus of instructions in English and Python functions. The function takes the object sizes as arguments and sets the position of the objects as indicated in the instruction. Some examples are shown in Figs. 3 and 4. Note that in addition to the object sizes, the function is also given the Cassowary linear constraint solver111The Cassowary algorithm is used by Apple UIKit to place UI elements in GUIs to specify the positions of objects as constraints to be solved. The arrange dataset has training / development / test split of 102 / 11 / 11 samples.
We also execute each program in the corpus for 20 different random initializations of the sizes of the objects to obtain the positions of the objects given those sizes. This secondary dataset is used for fair comparison with baseline models that directly predict the positions of the objects given the instruction and sizes of the objects.
V-A2 Manipulation Dataset
This task involves manipulating objects already present on the table as specified by the instruction in natural language. Typical manipulation tasks in this dataset are reaching for an object, pushing an object somewhere, and picking-and-placing an object. For this task, we have collected the manipulation dataset, a parallel corpus of instructions in English and Python functions. The function takes the positions and sizes of all the objects on the table and controls the robot through an API that allows it to specify a sequence of end-effector poses and gripper states (on/off). A few examples are shown in Figs. 5 and 6. The manipulation dataset has training / development / test split of 122 / 12 / 12.
For each sample in the manipulation corpus, the Python program is executed for 20 random initializations of the positions and sizes of the objects on table and with a mock robot that records the sequence of end-effector positions and gripper state changes. This is used for fair comparison with baseline models that directly predict the sequence of end-effector poses given the instruction text and the sizes and positions of the objects.
For the arrange dataset, we use LSTM+FC layers as the baseline. The LSTM encodes the instruction text into a fixed size vector. This is concatenated with the sizes of all the objects and passed through several fully connected layers to directly predict the positions of all the objects.
For the manipulation dataset, we use an encoder LSTM to encode the instruction and a decoder LSTM that, at every timestep, concatenates the decoder state and the attention context vector at that timestep along with the positions and sizes of all the objects on the table, and passes this concatenated vector through fully connected layers to predict the end-effector pose and grip state.
V-C Evaluation Metric
We use accuracy as the evaluation metric. Each of the predicted programs are executed 20 times with randomized object positions and sizes. For the arrange dataset, we treat the prediction to be “correct” if the absolute difference between the predicted position and ground truth position is less than 10% of the width of the table (on both x and y axes). For the manipulation dataset, the prediction is considered accurate if the absolute difference between the predicted trajectory and the ground truth trajectory is less than 10% of the width of the table at every timestep. This is merely an easy-to-evaluate proxy for whether the robot is truly accomplishing the task in the instruction. A more thorough evaluation that properly tests whether the task specified was performed successfully is conducted on a few samples with a real robot arm (Section V-E).
V-D Discussion of Results
|Model||Arrange Task||Manipulation Task|
|Proposed Seq2Seq model||80.8%||93.2%|
Table I compares the results of the proposed method with the baselines. All the architectures are trained with the Adam optimizer with learning rate 1e-3. For both tasks, the proposed method of generating a Python program and then executing that program outperforms the baselines which directly regress the object positions (arrange dataset) or end-effector poses (manipulation dataset).
Figures 8-11 show a few programs generated from the test set. Figures 12 and 13 show the attention weights for different tokens of the input instruction text when predicting a particular output token. We see that the attention mechanism is focusing on the relevant part of the instruction when predicting the program.
We have also experimented with replacing words in the instruction text with synonyms. We found that replacing “put” with “keep”, “place”, and “put down” always resulted in correct predictions. Likewise, we found that removing the word “the” does not change the output. Similarly, replacing “right-top corner” with only “right-top” or “top right” results in no changes to predicted sequence. However, substituting the words for objects, such as replacing “bottle” with “flask” or “pitcher” and “cup” with “chalice”, caused incorrect predictions.
We also found that the generalization worsens as the number of phrases in the input sequence increases (Figs. 8 and 10). There are only a few samples in the training set with 4 phrases (such as “place the orange at the bottom-right, the apple at the top-right, banana at the center, and the lemon to the right of the apple”). The model overfits on such long phrases and gives incorrect predictions that resemble the training data. However, if the input instruction is split at the commas into multiple short phrases, the model correctly predicts the positions for each of the phrases. But, this is not a viable solution because there are many instructions where such a split is not possible since the latter phrases refer to objects in the former (for example, “place the apple at the center, the orange at the top-right, and the banana in between them”).
V-E Demonstration on the Robot Arm
We demonstrate the complete pipeline with a Dobot Magician (Fig. 1). Common objects such as fruits, cups, magnets, etc. are used. An object detector is trained to detect the position and size of these objects, but the depth (tallness from the table surface) of the object is measured beforehand and hard coded. The camera feed from an overhead camera is passed through the object detector whose output is passed as arguments to the Python function generated by the proposed method from the natural language instruction, and the function is executed. Out of 25 trials, 19 were successful with the robot accomplishing the task. All the failures were due to inaccuracies in the object detector or the suction gripper failing to pick up the object. A video of the robot in operation is available at: https://youtu.be/TtoYE3EsDkc
Computer programs are a way to precisely specify tasks. We find that programs are rich representations of the expert demonstrations and are beneficial for learning to control robots. We also showed that translating natural language instructions to computer programs outperforms directly predicting the robot actuator commands. Moreoever, the predicted programs are interpretable and easier to analyse than end-to-end neural networks that directly predict robot actions. Although this approach is necessarily constrained to those problems for which the solution can easily be expressed as a program, the proposed approach may find use in augmenting teach pendants for industrial robots to generate programs based on verbal instructions.
We thank Mohammed Rizvi for his suggestions. We also thank the Robert Bosch Center for Cyber-Physical Systems for funding support.
-  (2017) A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275. Cited by: §II.
-  (2019) Learning physics-based manipulation in clutter: combining image-based generalization and look-ahead planning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 6562–6569. Cited by: §I.
-  (2016) Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 751–761. Cited by: §II, §V-B.
-  (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §II.
-  (2020) Enabling robots to understand incomplete natural language instructions using commonsense reasoning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1963–1969. Cited by: §II.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §IV.
-  (2019) Learning to generate unambiguous spatial referring expressions for real-world environments. arXiv preprint arXiv:1904.07165. Cited by: §II.
-  (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155. Cited by: §II.
A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters 1 (2), pp. 661–667. Cited by: §I, §II.
-  (2020) Imitation learning for high precision peg-in-hole tasks. In 2020 6th International Conference on Control, Automation and Robotics (ICCAR), pp. 368–372. Cited by: §I, §I, §II.
-  (2020) TAPAS: weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349. Cited by: §II.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §II.
-  (2020) Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511. Cited by: §II.
-  (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §I.
Clevr-ref+: diagnosing visual reasoning with referring expressions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4185–4194. Cited by: §II.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §V-E.
-  (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §IV.
-  (2020) Grounding language in play. arXiv preprint arXiv:2005.07648. Cited by: §I, §II.
-  (2019) Scaling robot supervision to hundreds of hours with roboturk: robotic manipulation dataset through human reasoning and dexterity. arXiv preprint arXiv:1911.04052. Cited by: §I, §II.
-  (2018) Roboturk: a crowdsourcing platform for robotic skill learning through imitation. arXiv preprint arXiv:1811.02790. Cited by: §I, §II.
From virtual demonstration to real-world manipulation using lstm and mdn.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §II, §V-B.
-  (2018) Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3758–3765. Cited by: §I, §II.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §II.
-  (2020) Concept2Robot: learning manipulation concepts from instructions and human demonstrations. In Robotics: Science and Systems, Cited by: §II.
-  (2018) Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831. Cited by: §II.
-  (2019) One-shot object localization using learnt visual cues via siamese networks. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6700–6705. Cited by: §II.
-  (2020) Multi-instance aware localization for end-to-end imitation learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §II.
-  (2020) Teaching robots novel objects by pointing at them. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 1101–1106. Cited by: §II.
-  (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. External Links: Cited by: §II.
-  (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §I.
-  (2018) Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1–8. Cited by: §II.
-  (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §I, §I, §II.