This paper addresses the problem of spatial reasoning implicit in natural language instructions to the robot to move objects. Figure 1 illustrates a simple model that is representative of this problem. In this example, the objects are the playing cards. The text instruction is of the form “Line up card 3 squarely above card 6”. The robot needs to use this along with the visual input from the camera to infer the start co-ordinate from where the robot must pick up the object and the end co-ordinate where the object must be placed.
One approach is to use an end-to-end network that takes both the image from the camera and the text instruction and directly predicts the physical locations like the start and end co-ordinates. But such a network must implicitly learn to detect and localize objects, which can be difficult to do from a small dataset. Alternatively, the image from the camera feed can be processed by a separately trained object detector to identify and localize all the objects in the image. The positions of all the objects along with the natural language instruction are then used by a language network to predict the start and end co-ordinates. The approach in  represents the output of the object detector as a list of 2D co-ordinates indicating the position of the objects in the scene (see Lang-FCNet in Fig. 2(i)). However, this representation has shortcomings which results in poor performance. Hence, we propose an alternative representation for the output of the object detector.
To see why representing the object positions as a list can be sub-optimal, consider the problem of finding “the second card in a row of cards”. If fully connected layers are used to process the coordinate list based output of the object detector, the network can overfit to specific locations of the row of cards in the training set, and the network structure is not inherently conducive to generalizing to different positions of the row of cards on the table. To address this, we propose representing the output of the object detector as a binary 2D image with each lit pixel corresponding to an object and using a convolutional network to predict the start and end positions (see Lang-UNet in Fig. 2(ii)). With this, we expect improved generalization because the convolution operation is, by construction, spatially invariant.
Object representation in a 2D binary grid via a pre-processing network: We experiment with two different spatial representations and show that instead of representing the localized objects as a list of 2D co-ordinates and processing them with fully connected layers, the spatial reasoning can be improved by representing the detected objects on a 2D binary grid and using a convolutional U-Net to predict the pixels on the grid corresponding to the start and end positions.
Multi-head attention and visual grounding: We show that a recurrent network that generates attention for visual grounding generalizes better than a network without attention and can overcome biases in the training dataset.
Ii Related Work
Research on manipulating robots using natural language instructions has gained significant interest. The entire body of related work can be broadly categorized into end-to-end approaches and pipelined approaches. In the end-to-end approaches , , , , , , the robotic agent simultaneously interacts with the surrounding environment while executing natural language instructions and takes a sequence of actions to fulfil its goal. In contrast, the line of work that follows the pipelined approach, , , ,  breaks up the task into navigation planning and language grounding processes. Our work is similar to , , that used neural models to localise the scene objects and to ground the spatial relations in unrestricted natural language instructions into a blocks world with complex goal configurations. While their work suffers from poor generalization, we use the attention mechanism 
to solve the problem in a natural way. Recent work in reinforcement learning has demonstrated the usefulness of representing the state information as pixels in a 2D image instead of a list of numbers, which is similar in spirit to this paper although the actual representation is different.
Earlier works on Human-Robot Interaction have focused on converting a language command with restricted vocabulary and simple actions into a structural form easily understandable by an agent to execute it . Reinforcement learning based techniques have also been explored for the instruction-following task. The limited action space and little diversity in the language instructions have proven to be non-robust, especially when the instructions are generated by non-experts. Our work addresses these challenges and is able to handle the underlying diversity in the language.
Vision and language grounding are the two important components for an effective human-robot communication through natural language instructions in the context of the surrounding world view. Grounding visual inputs has proven to be extremely essential to many vision tasks like image captioning, visual question-answering, embodied question-answering and vision-language-navigation. A surfeit of work has been done on grounding natural language instructions  using variety of techniques like semantic and syntactic parsing, alignment models. Improving language understanding using human-robot dialog and commonsense reasoning as well as generation of unambiguous spatial-referring expressions have also been explored. These papers explore different aspects of grounding natural language in vision, whereas our work focuses on spatial reasoning from natural language instructions.
Iii Problem Statement
Given a natural language instruction with embedded spatial cues and an image of the world view, the goal is to understand the instruction in the context of world view and to act in accordance with the spatial cues. For the pick-and-place task, the robot must move to the location where the desired object is present, pick it up, and then place it at the goal/target location. We present a few examples from the datasets in Sec. V-A:
Move block 5 from the top right of box 11 to above box 14 in the middle with a small space.
Place block 5 one and a half columns to the right of block 18.
Pick the first apple from row number one.
Many oranges are placed at random. Pick the biggest orange.
Even though the first two expressions mentioned above differ considerably in their language form, they refer to the same world view and instruct to perform the same action. The model must be robust enough to discern the start block (block 5 in first two examples), choose the correct target anchor (e.g. block 14 and not block 11, in the first example), recognize the notion of direction (e.g. right of block 18 or above block 14) and ground the distance information (e.g. one and a half column to the right of block 18). The last two representative instructions test the model’s ability to understand abstract concepts (e.g. top row), reason about object size (e.g. biggest orange), ordinality (e.g. first apple) and cardinality (e.g. row number one).
Iv Network Architecture
Our proposed language network Lang-UNet (Fig. 3) takes as input a natural language instruction and the object positions and sizes from the object detector and finally predicts the start and end co-ordinates . The robot picks the object from start location and places it at the end location.
The object positions are represented as a binary image where is the number of distinct objects. Each object in the scene is represented as a pixel in
by a one-hot vector corresponding to the type of the object. The sizes of the objects are represented in an imagewith if there is no object at .
The instruction text is tokenised with minimal pre-processing (lowercase words, removed punctuation) into a sequence of tokens and fed into an embedding layer to obtain a vector representation for each token . Note that is possible to use BERT to obtain embeddings for the tokens in the instruction, but we chose to learn embeddings from random initialization for easier comparison with a number of previous works, which highlights the benefit of our proposed representation of the object positions and sizes. The token embeddings are passed through a 2-layer Bi-directional LSTM network. The encoded vector outputs are then passed through a 1-D convolutional network with softmax activation to obtain the attention energies for four attention heads. We denote the attention value for token and attention head. The instruction embeddings are computed as follows:
The first two instruction embeddings are projected to using two separate fully connected layers to obtain and . Each pixel of is correlated with to obtain (and likewise is obtained). These two embeddings “soft-select” appropriate objects in the scene while suppressing the rest.
The embeddings and indicate attributes such as spatial relationships referred to in the instruction. They are repeated times and appended to , , and to get . The image is passed through the convolutional hourglass network (U-Net) as shown in Fig. 3 to obtain . The start location and the end location are extracted from and
respectively by passing it through a spatial-softmax layer.
where and .
Note that the U-Net structure has no notion of which object is at a particular position (notice that it’s input size is independent of ). It is only aware that a particular object “selected” by the BiLSTM layers via or is present at a location (Eqn. 2). This ensures that the U-Net learns only spatial relationships and not anything specific to an object. So, if the network has learnt to find the position of “an apple to the left of the banana”, it will generalize to “an orange to the left of the banana”.
V Experimental Results
We first evaluate the language network separately on two different datasets. Subsequently, we discuss the performance of the entire pipeline on a real robot arm.
V-a Datasets for the Language Network
To evaluate the language network, we assume that the object positions and sizes are known. We have experimented with two datasets. We use the publicly available Blocks dataset . Additionally we synthesize a diagnostic dataset to test our model performance for more diverse and complicated visual scenarios. We briefly explain the datasets as follows:
Blocks 2D: In the blocks dataset, each sample has a natural language instruction and the positions of all 20 blocks as the input and the labels are the position from which the block must be picked up (start position) and the location where the block must be placed (end position). A sample instruction: “Pick up block 9 and place it above block 8”. The corpus has a training / development / test distribution of 3712 / 699 / 705 instructions.
Synthetic Dataset: The Blocks dataset has a few limitations: (a) All the blocks are uniquely numbered and only one instance of each block is in the scene, (b) In most of the cases, the instructions are such that the goal location can be obtained by finding the appropriate anchor block and a relative offset direction from a predefined set, and (c) The sizes of the blocks are identical. To diagnose whether the proposed model is capable of reasoning about a variety of other spatial relationships, object attributes (e.g. size), abstract concepts (e.g. row or column) as well as scenes with multiple instances of each object, we build a synthetic dataset. We follow a similar approach proposed in  and generate 42,000 unique instructions with varied scenes containing objects of sizes randomly chosen between 1.0 to 3.0 and divide it into train / dev / test distribution of 29465 / 4216 / 8416. Each scene contains a maximum of 12 distinct objects and up to a total of 24 objects. Some of the representative templates used to generate instructions are as follows: (i) Pick the largest / smallest obj, (ii) Pick the leftmost / rightmost obj from the row of objs, (iii) Pick the obj_pos obj from top row, (iv) Pick the obj1 above / below / to the left of / to the right of obj2. For this dataset, we predict only the start location and ignore the end location.
V-B Evaluation Metrics for the Language Network
We define two evaluation metrics to compare the baseline results quantitatively with ours.
Mean Squared Error (MSE): It is the average over the squared distances between the gold and predicted locations. Define the prediction and gold locations for instruction as and respectively. Then . The center of the simulated world is at and restricted in the range of in both and directions.
Tolerable Accuracy (TA): In majority of the real world applications, it is acceptable even if the predicted and the target locations do not exactly match but the distance between them is within a certain tolerable (application-specific) range. To account for this fact, we propose a new metric Tolerable Accuracy. A prediction is considered to be correct if the distance error (in both and directions in simulated world) is less than a tolerance value tol. We count the number of correct prediction instances out of the total instructions to evaluate TA. Mathematically, . is an indicator function and it’s value is when is true, otherwise .
V-C Baseline Algorithms for the Language Network
We compare our proposed model with the following baseline algorithms:
Center: This model assumes complete knowledge about the start location and places the block at the middle of the table.
Random: The Random baseline decides both the start and end locations to be random. The Center and Random are two simple baselines taken from .
LSTM : The word embeddings of the instructions obtained from an embedding layer is passed through a word-level multi-stage LSTM model. The output from the last layer of LSTM is the instruction embedding and passed through a MLP to obtain the final outputs. This model does not use the image information, hence can identify object positions only through the presence of any bias in the instructions.
LSTM+CNN : This is an end-to-end approach that takes the image and the text instruction as input and directly predicts the start and end positions. As above, the instruction embedding is obtained from the last-layer hidden state of LSTM and the image is encoded using CNN features. They are concatenated and fed into a MLP to get the prediction of the start and end locations.
LSTM+CNN+SA : The encodings for image and instruction are generated as above and then soft spatial attention is employed to get the final representation. A MLP takes this as input and produces the required output.
RNN-NoAttn-NoGround : The model architecture has a single-layer RNN at its heart. It takes as input the instruction and predicts the start object, anchor object, and chooses from 8 pre-defined offsets corresponding to the 8 adjacent positions (right, bottom-right, etc.). The start position prediction is simply the position of the predicted object to be picked up. The end position is obtained by adding the position of the predicted anchor object and the predicted offset. Note that this model is not grounded because the predictions of the start, anchor, and offset are invariant to the positions of the objects in the scene and depend only on the natural language instruction. Furthermore, it cannot distinguish between two or more instances of the same object in the scene.
LangNet-Attn-NoGround: We extend the model in Bisk et. al  by introducing an attention mechanism over a multi-layer BiLSTM.
Lang-FCNet: This model differs from the proposed Lang-UNet model in that object positions are represented as a list of 2D points rather than as a binary image, and they are passed through fully connected layers rather than the convolutional hourglass network (U-Net).
V-D Results for Language Network
|MSE||TA (%)||MSE||TA (%)||MSE||TA (%)|
Better generalisation with Attention: The language network is trained on the blocks dataset using the Adam optimizer with mean absolute error loss, with learning rate 1e-3, and weight decay 1e-9. A few sample predictions of the Lang-UNet
(BiLSTM model with attention) model are visualized in Figs.4 and 5. For one example, the attention weights for the different tokens of the instruction when predicting the start, anchor, and offsets are shown in Fig. 6. Table I compares the performance of the proposed approach with the baselines. We observe that the attention-based models - LSTM-Attn-NoGround and Lang-UNet perform significantly better than the RNN-NoAttn-NoGround model that has no attention component, especially for end co-ordinate prediction. Fig. 6 shows the attention mechanism is able to attend on the correct offset and target block and intuitively explains the reason behind improved performance for end co-ordinate prediction. Note that the accuracy of predicting the end location is worse than for the start location. This is because in most textual instructions, the start location is simple and unambiguous (“place block 4…” or “pick up block 3…”), whereas the target is more complex (last example in Fig. 4) and sometimes ambiguous (“the 14th block moved next to the 12th block”). It also suggests why attention is more important for predicting the end location than the start in case of Blocks dataset.
Mitigating the effect of bias through Attention: We noticed that the Blocks dataset is biased with some block numbers more frequently being associated with some offsets (such as “north of”) than others. Because of this, the RNN-NoAttn-NoGround model overfits and always predicts the same offset when some block numbers are present in the instruction and ignore the actual content of the text. In contrast, the LSTM-Attn-NoGround model and the Lang-UNet are forced to attend to the offset token in the instruction and gets it right. For example, all the models correctly predict the output for “Move block 4 above block 5”. But, when the block number is changed to “Move block 11 above block 5”, only Lang-UNet and LSTM-Attn-NoGround make the correct prediction. To quantify this, we selected 20 simple examples such as the above example from the validation set. The Lang-UNet, RNN-NoAttn-NoGround models correctly predicted 19 and 19 examples respectively. But when we randomized the block numbers in those instructions, the number of correct predictions were 19 and 6 respectively. Attention was necessary to retain performance and demonstrates its usefulness in such biased datasets.
Visual grounding is essential for diverse and complex data: We note that the ungrounded models which use the natural language instruction alone and do not use object positions perform reasonably on the Blocks dataset. However, they perform poorly on the synthetic dataset because it is not possible to predict the correct position for an instruction such as “Pick the apple to the left of the orange” without actually using the object positions. The proposed Lang-UNet peforms well on both the Blocks and the synthetic datset, but slightly underperforms RNN-NoAttn-NoGround on the Blocks dataset due to the quantization in representing the object positions. Moreover, the LSTM-Attn-NoGround and RNN-NoAttn-NoGround models use hard-coded offsets ( in both X and Y directions) that are added to the position of the anchor object to predict the end position and are thus specialized to the Blocks dataset, whereas the proposed LangUNet does not use such hard-coded offsets.
Benefits of Grid representation over List: The Lang-FCNet takes the output from the localisation network as a list of 2D points, whereas the Lang-UNet considers a 2D binary grid representation as explained in Sec. IV. Fig. 2 depicts the differences between the output formats from the localisation network. On examples such as “Pick up the banana from the row of bananas”, Lang-UNet is successful in predicting the location of the banana because the convolutional layers in the U-Net help in recognizing “rows”, whereas Lang-FCNet performs poorly. From Table I, we infer that the performance improvement in Lang-UNet over Lang-FCNet, particularly for the synthetic dataset, is due to the binary grid representation because it provides the model a better way to understand the object positions and the relative spatial relationships amongst each other compared to a list of 2D coordinates. Our empirical evaluation in Table I also suggests the superiority of the pipelined approaches (Lang-FCNet and Lang-UNet) over the end-to-end models (LSTM+CNN and LSTM+CNN+SA).
V-E Demonstration on the Robot Arm
We demonstrate the complete pipeline using a Dobot Magician robot arm (Fig. 1). Playing cards are placed at random positions in front of the robot. The position of the cards is obtained using an object detector that is fine tuned to detect playing cards. Based on the instruction, the robot picks-and-places a card. Out of 15 trials, the robot successfully picks the right card in all the trials. In 14 cases, the card is placed within 1 cm of the target. In one case, the localization of the anchor is off by more than 1 cm. A video of the robot in operation is available at https://youtu.be/UMPpDt0mwIg.
In this paper, we have illustrated the advantages of a pipelined approach to manipulating objects based on natural language instructions. We propose having a separately trained object detector followed by a language network that is responsible for predicting the start and end positions to pick-and-place objects based on the natural language instruction. We show that representing the positions of the detected objects on a 2D binary grid and processing them with a convolutional hourglass network results in much better performance than representing them as a list of 2D co-ordinates and processing with fully connected layers. We also show that attention improves the generalization, especially when the training data is biased.
We would like to thank the Robert Bosch Center for CyberPhysical Systems for funding support.
-  (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In , pp. 4971–4980. Cited by: §II.
-  (2018) Vision-and-Language Navigation: interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II, §II.
Alignment-based compositional semantics for instruction following.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Cited by: §II.
-  (2017) Contextual awareness: understanding monologic natural language instructions for autonomous robots. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 502–509. Cited by: §II.
Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics 1. Cited by: §II.
-  (2017) Accurately and efficiently interpreting human-robot instructions of varying granularities. Robotics: Science and Systems Foundation. Cited by: §II.
-  (2015) Neural machine translation by jointly learning to align and translate. ICLR. Cited by: §II.
Towards a dataset for human computer communication via grounded language acquisition.
Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §I, §II, §V-A.
-  (2016-06) Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California. Cited by: §I, §I, §II, §V-C, §V-C, §V-C, TABLE I.
-  (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. Conference on Robot Learning (CoRL). Cited by: §II.
-  (2009) Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 82–90. Cited by: §II, §II.
-  (2019) Enabling robots to understand incomplete natural language instructions using commonsense reasoning. arXiv preprint arXiv:1904.12907. Cited by: §II.
-  (2018) Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2054–2063. Cited by: §II.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §IV.
-  (2019) Learning to generate unambiguous spatial referring expressions for real-world environments. arXiv preprint arXiv:1904.07165. Cited by: §II.
-  (2009) What to do and how to do it: translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In 2009 IEEE International Conference on Robotics and Automation, pp. 4163–4168. Cited by: §II.
-  (2018) Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pp. 3314–3325. Cited by: §II, §II.
-  (2016) Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4555–4564. Cited by: §II.
-  (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §V-A, §V-C, §V-C, §V-C.
Human-robot communication and machine learning. Applied Artificial Intelligence 11 (7), pp. 719–746. Cited by: §II.
-  (2020) Reinforcement learning with augmented data. arXiv preprint arXiv:2004.14990. Cited by: §II.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §V-E.
-  (2018) Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7219–7228. Cited by: §II.
Self-monitoring navigation agent via auxiliary progress estimation. ICLR. Cited by: §II.
-  (2006) Walk the talk: connecting language, knowledge, and action in route instructions. Association for the Advancement of Artificial Intelligence (AAAI). Cited by: §II.
-  (2015) A review of verbal and non-verbal human–robot interactive communication. Robotics and Autonomous Systems 63, pp. 22–35. Cited by: §II.
-  (2016) Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §II.
-  (2018-October-November) Mapping instructions to actions in 3D environments with visual goal prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Cited by: §II, §II.
-  (2017-09) Mapping instructions and visual observations to actions with reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. Cited by: §II.
-  (2016) Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. Robotics: Science and Systems Foundation. Cited by: §II.
-  (2017) Communication with robots using multilayer recurrent networks. In Proceedings of the First Workshop on Language Grounding for Robotics, pp. 44–48. Cited by: §II.
-  (2018) Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831. Cited by: §II.
-  (2011) Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-fifth AAAI conference on artificial intelligence, Cited by: §II.
-  (2019) Improving grounded natural language understanding through human-robot dialog. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6934–6941. Cited by: §II.
-  (2010) Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 806–814. Cited by: §II, §II.
-  (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 37–53. Cited by: §II.