Learning from Implicit Information in Natural Language Instructions for Robotic Manipulations

04/30/2019 ∙ by Ozan Arkan Can, et al. ∙ Koç University 0

Human-robot interaction often occurs in the form of instructions given from a human to a robot. For a robot to successfully follow instructions, a common representation of the world and objects in it should be shared between humans and the robot so that the instructions can be grounded. Achieving this representation can be done via learning, where both the world representation and the language grounding are learned simultaneously. However, in robotics this can be a difficult task due to the cost and scarcity of data. In this paper, we tackle the problem by separately learning the world representation of the robot and the language grounding. While this approach can address the challenges in getting sufficient data, it may give rise to inconsistencies between both learned components. Therefore, we further propose Bayesian learning to resolve such inconsistencies between the natural language grounding and a robot's world representation by exploiting spatio-relational information that is implicitly present in instructions given by a human. Moreover, we demonstrate the feasibility of our approach on a scenario involving a robotic arm in the physical world.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider yourself standing in your kitchen and having your robot assist you in preparing tonight’s meal. You then give it the instruction: ‘fetch the bowl next to the bread knife!’. For the robot to correctly perform your intended instruction, which is grounded in your world representation, it must correctly ground your natural language instruction into its own world representation.

This small scenario already introduces the two key components of language grounding in robotics: the construction of a world representation from sensor data and the grounding of natural language into the constructed representation. Ideally these two components would be learned in a joint fashion Hu et al. (2017a); Johnson et al. (2017); Santoro et al. (2017); Hudson and Manning (2018); Perez et al. (2018)

. However, the scarcity of data makes this approach impractical. The millions of data points necessary for state-of-the-art joint computer vision and natural language processing are simply non-existing. We opt, therefore, to separately learn the world representation component and the language grounding component.

One approach for constructing a world representation of a robot is through so-called perceptual anchoring. Perceptual anchoring handles the problem of creating and maintaining, over time, the correspondence between symbols in a constructed world model and perceptual data that refer to the same physical object Coradeschi and Saffiotti (2000). In this work, we use sensor driven bottom-up anchoring Loutfi et al. (2005), whereby anchors (symbolic representations of objects) can be created by perceptual observations derived directly from the input sensory data. When modeling a scene, based on visual sensor data, through object anchoring, noise and uncertainties will inevitably be present. This leads, for example, to a green ’apple’ object being incorrectly anchored as a ’pear’.

For the language grounding, we opt to perform the learning on synthetic data that simulates the world represented as anchors. This means that we do not ground the language using sensor data as signal but a symbolic representation of the world. During training these symbols are synthetic and simulated, and during the deployment of the language grounding these are anchors provided by an anchoring system. As the real world is inherently relational and as natural language instructions are often given in terms of spatial relations as well, the learned language grounder must also be able to ground spatial language such as ‘next to’.

As a result of learning the construction of a world model and the language grounding separately, contradictions arise between the world representations of a human and a robot. The supervision that an instruction would give to a robot is not present when learning the representation of the world of a robot. These inconsistencies then propagate through to inconsistencies between the instructions a human gives to a robot and the robot’s world model. To ensure that a robot is able to correctly carry out an instruction, such inconsistencies must be resolved and the world model of the robot be matched to the world model of the human.

This is not the first paper that tackles the problem of belief revision in robotics. However, prior work Tellex et al. (2013); Thomason et al. (2015); She and Chai (2017), with the notable exception of Mast et al. (2016), relied on explicit information transfer between humans and robots when inconsistencies arose in grounded language and the robot’s world representation. An example would be a robot asking clarification questions until it is clear what the human meant Tellex et al. (2013).

We propose an approach that probabilistically reasons over the grounding of an instruction and a robot’s world representation in order to perform Bayesian learning to update the world representation given the grounding. This is closely related to the work of Mast et al. who also deploy a Bayesian learning approach. The key difference, however, is that they do not learn the language component but ground a description of a scene by relying on a predefined model to ground language. We demonstrate the validity of our approach for reconciling instructions and world representations on a showcase scenario involving a camera, a robot arm and a natural language interface.

2 Preliminaries

The overarching objective of our system is to plan and execute robot manipulation actions based on natural language instructions. Presumptuously, this requires, in the first place, that both the planner of the robot manipulator, as well as the natural language grounder (cf. Section 2.2), share a joint semantically rich object-centered model of the perceived environment, i.e., a semantic world model Elfring et al. (2013).

2.1 Visual Object Anchoring

In order to model a semantic object-centered representation of the external environment, we rely upon the notions and definitions found within the concept of perceptual anchoring Coradeschi and Saffiotti (2000). Following the approach for sensor-driven bottom-up acquisition of perceptual data, as described by Persson et al. (2019), the used anchoring procedure is, initially, triggered by sensory input data provided by a Kinect2 RGB-D sensor. Each frame of input RGB-D data is, subsequently, processed by a perceptual system, which exploits both the visual information, as well as the depth information, in order to: 1) detect and segment the subset of data (referred to as percepts), that originates from a single individual object in the physical world, and 2) measure attribute values for each segmented percept, e.g., measuring a position attribute as the geometrical center of an object, or a visual color attribute measured as a color histogram (in HSV color space).

The percept-symbol correspondence is, thereafter, established by a symbolic system, which handles the grounding of measured attributes values to corresponding predicate symbols through the use of predicate grounding relations, e.g., a certain peek in a color histogram, measured as a color attribute, is mapped to a corresponding predicate symbol ‘red’. In addition, we promote the use of an object classification

procedure in order to semantically categorize and label each perceived object. The convolutional neural network (CNN) architecture that we use for this purpose is based on the

GoogLeNet model Szegedy et al. (2015), which we have trained and fine-tuned based on object categories that can be expected to be found in a kitchen domain.

The extracted perceptual and symbolic information for each perceived object is then encapsulated in an internal data structure , called an anchor, indexed by time and identified by a unique identifier (e.g. ‘mug-2’, ‘apple-4’, etc.). The goal of an anchoring system is to manage these anchors based on the result of a matching function that compares the attribute values of an unknown candidate object against the attribute values of all previously maintained anchors. Anchors are then either created or maintained through two general functionalities:

  • Acquire – initiates a new anchor whenever a candidate object is received that does not match any existing anchor .

  • Re-acquire – extends the definition of a matching anchor from time to time . This functionality assures that the percepts pointed to by the anchor are the most recent perceptual (and consequently also symbolic) representation of the object.

However, comparing attribute values of anchored objects and percepts by some distance measure and deciding, based on the measure, whether an unknown object has previously been perceived or not is a non-trivial task. Nevertheless, since anchors are created or maintained through either one of the two principal functionalities acquire and re-acquire, it is evident that the desired outcome for the combined compared values is a binary output, i.e. should a percept be acquired or re-acquired. In previous work on anchoring Persson et al. (2019), we have therefore suggested that the problem of invoking a correct anchoring functionality is a problem that can be approximated through learning from examples and the use of classification algorithms. For this work, we follow the same approach.

2.2 Natural Language Grounding

Figure 1: Demonstration of the language grounding process for the instruction ”pick up the apple to the right of the black mug”. Anchoring system sends the snapshot of the anchors (1). Then, a preprocessor transforms the anchors into a grid representation which the language grounding system operates on (2). The parser parses the given instruction and generates a computation graph which specifies the execution order of neural modules (3). Finally, the neural modules are executed according to the computation graph to produce the action (4).

In this study, we focus on understanding spatial language that includes pick up and place related verbs, and referring expressions. An instruction refers to a target object using its representative features (e.g. color, shape, size). If a noun phrase does not resolve the ambiguity in the world, the instruction resolves the ambiguity by specifying the target object with its relative position to other surrounding objects. This hierarchy tries to bring the attention to finding the unique object, then shifts the attention to the targeted object. Based on this idea, we model the language grounding process as controlling the attention on the world representation by adapting the neural module networks approach proposed by andreas2016neural.

Our natural language grounder has three components: a preprocessor, an instruction parser and a program executor. Given specific anchor information (Figure 1 – № 1), the preprocessor transforms the anchor information into an intermediate representation in grid form (Figure 1 – № 2). The instruction parser produces a computational program by exploiting the syntactic representation (Figure 1 – № 3) of the instruction with a dependency parser111https://spacy.io/. The program executor runs (Figure 1 – № 4) the program on the intermediate representation to produce commands.


The anchoring framework maintains the object descriptions predicted from the raw visual input. To be able to ground the language onto those descriptions, we map the available information (object class, color, size and shape attributes) to a 4D grid representation. We represent each anchor as a multi-hot vector and assign this vector to a cell where the real world coordinates of the object fall into.

Program Executor. The program generated by the parser is a collection of neural components that are linked to each other depending on the computation graph. The design of neural components reflects our intuition about the attention control. A Detect module is a convolutional neural network with a learnable filter that captures a noun or an adjective. This module creates an attention map over the input grid.


The Detect module operates on the original grid input tensor, where the dimensions are . The first three dimensions represent the spatial dimensions and denotes the length of the feature vector. is the filter of size and is the bias. is a convolution operation.

Although a Detect module can capture the meaning of a noun phrase (e.g., red book), the model cannot generalize to unseen compound words. To overcome this, we design the And module to compose the output of incoming modules. This module multiplies the inputs element-wise in order to calculate the composition of words (e.g., the big red book). Since the incoming inputs are attention maps over the grid world, an And module produces a new attention map by taking the conjunction of its inputs. In the following equation, the denotes the element-wise multiplication.


An output of a subgraph for a noun phrase is an attention map that highlights the positions for the corresponding objects that occur. A Shift module shifts this attention in the direction of the preposition that the module represents. This module is also a convolutional neural network similar to a Detect module. However, the module remaps the attention instead of capturing the patterns in the grid world.


The Shift module operates on an incoming attention map, where the dimensions are . is the filter of size of

. We use the padding to be able to perform the shifting operation over the whole grid. The pad size is the same as the input size.

A Locate

module takes an attention map and produces a probability distribution over cells by applying a softmax classifier for being the targeted object. We use the cell with the highest probability as the prediction. A

Position module gets a source anchor, a preposition and a target anchor, and produces a real world coordinate. It merely calculates the position available in the direction of the preposition from the target anchor, where the source anchor can fit.

Parser. We find the verbs in the instruction along with the subtrees attached to them. For each verb and its subtree, we search for the direct object of the verb. Then we build a subgraph for the direct object and its modifiers. Depending on the verb type, we build different subgraphs. If the verb is ”pick up” related, then we look for the preposition that relates the given noun to another noun. If one is found, then a subgraph is created for the preposition object using the noun phrase that the object belongs to. Finally, the end point of the subgraph is combined with a Shift module. For each preposition object, we repeat the same process to handle prepositional phrase chains.

Figure 2: A depiction of both used physical system setup (upper), as well as used software architecture (lower). The arrows represent the flow of data between the modules of the software architecture. Blue solid arrows and boxes illustrate the preliminary system (outlined in Section 2), while red dashed arrows and boxes illustrate the novel extension for reasoning about different symbolic label configurations (and hence resolving inconsistencies between language and perception), by using Bayesian learning (as presented in Section 4).

If the verb is ”put” related, we find the preposition that is linked to the verb and the object of the preposition. We build a subgraph that refers to the object of the preposition similar to the ”pick up” case. Finally, there is a Position module to produce the coordinates to put the direct object, where the position is referred with the auxiliary objects.

3 System Description

In the upper part of Figure 2, we illustrate our physical kitchen table system setup, which consists of the following devices: 1) a Kinova Jaco light-weight manipulator Campeau-Lecours et al. (2019), 2) a Microsoft Kinect2 RGB-D sensors, and 3) a dedicated PC with an Intel© Core i7-6700 processor and an Nvidia GeForce GTX 970 graphics card.

In addition, we have a modularized software architecture that utilizes the libraries and communication protocols available in the Robot Operating System (ROS)222http://www.ros.org/. Hence, each of the modules, illustrated in the lower part of Figure 2, consists of one or several individual subsystems (or ROS nodes). For example, the visual object anchoring module consists of the following subsystems: 1) a perceptual system, 2) a symbolic system, and 3) an anchoring system. For a seamless integration between software and hardware, we are further taking advantage of both the MoveIt! Motion Planning Framework333https://moveit.ros.org/, as well as the ROS-Kinect2 bridge developed by Wiedemeyer (2014 – 2015). The MoveIt! ”planning scene” of the action planner for the robot manipulator, as well as the grid world representation used by the language grounding system (cf. Section 2.2), are, subsequently, both populated by the same updating anchoring representations (cf. Section 2.1). Hence, the visual sensory input stream is indirectly mapped to both objects considered in the dialogue by the language grounder, as well as the objects upon which actions are executed.

4 Resolving Inconsistencies

Based purely on the perceptual input, the anchoring system produces a probability distribution over the possible labels (e.g. ) for each anchor. We are now interested in the probability of a label for an anchor given a natural language instruction and the grounding of that instruction in the real world. This is the conditional probability . We introduce, furthermore, the notion of a label configuration . This is easiest explained by an example: imagine having two anchors and each of the anchors has two possible labels, then there are possible label configurations. A label configuration is, hence, a label assignment to all the anchors present in the scene.

Now we need to transform the conditional probability into a function that is computable by the anchoring system and the language grounder. The first steps (Equations 4-6) are quite straight forward and follow basic probability calculus.


In Equation 6 we assume that and are conditionally independent of the label of an anchor given the label configuration . This can be seen in the following way. Imagine two anchors with two possible labels each. Given that we are in a specific label configuration, we immediately know what label the single anchors have. This means that the probability of a label for an anchor is if it matches the label in the configuration and otherwise. This reasoning is independent of the grounding and the instruction.

We have now split up the labels (produced by the anchoring system) and the grounding into two factors, which can be calculated separately. The first one can be calculated as follows:


This is the product of the probabilities of the labels that constitute a label configuration divided by the number of configurations. Assuming a uniform distribution over the label configurations (division by

) is equivalent to assuming that each possible label configuration is equally likely a priori. This means that we make no assumption about which class of objects occur more regularly or which class of objects (of the 101 possible classes) occur more often together with other classes of objects.

We tackle now the second factor in Equation 6. Equation 8-11 are again straightforward probability calculus. In Equation 12 we assume that the label configuration and the instruction are independent: their probabilities factorize. In Equation 13 the probabilities of cancel out and we assume again a uniform distribution for the label configurations (cf Equation 7). In Equation 14 we then have a numerator and denominator that are expressed in terms of , which is exactly the function approximated by our neural language grounding system, cf. subsection 2.2.


Plugging Equations 7 and 14 back into Equation 6 gives the learned probability of the label of an anchor given the instruction and the grounding of that instruction.


As mentioned in Section 2.1 the anchoring system encapsulates object categories, which means that the anchoring system produces a categorical probability distribution over different labels for each anchor. With only two anchors this results in already different configurations. It is easy to see that computing (cf. Equation 15

) suffers from this curse of dimensionality. Therefore, we limited ourselves to the two labels with the highest probability per anchor, in the experiments too. This gives

possible configurations, with being the number of anchors present.

5 Experiments

5.1 Synthetic Data

Data demanding nature of neural networks requires large amounts of data to generalize well. Artificial data generation is one way of generating such datasets Andreas et al. (2016b); Kuhnle and Copestake (2017); Johnson et al. (2016). Therefore, we designed a series of artificial learning tasks before applying the model to a real-world problem. In each task, we generate a random grid world that provides the necessary complexity and ambiguity that fit the scenario. First, an object is placed on the grid world and decorated with attributes randomly as the target object. Then depending on the scenario, an auxiliary object and distractors (objects that have similar attributes as the target object) are placed on the grid world. We also generate objects that are not related to the target (or auxiliary object) to introduce additional noise. We limit the total number of objects to . We set the number of distractors as in the experiments. Finally, we generate the ground truth computation graph for composing neural modules. We list the scenarios below in increasing order of difficulty (i.e., a combination of the ambiguity present in the grid world and the number of language components involved).

  1. Using the name of a targeted object in the instruction is enough to localize the targeted object.

  2. There is more than one object that has the same category with a targeted object. To solve the ambiguity, one or more discriminative adjective(s) are used.

  3. The same world configuration as the second one. To solve the ambiguity, the object is described with a prepositional phrase that utilizes a single referent object.

  4. The same world configuration as the third. Adjectives are used to describe a targeted object in addition to a prepositional phrase. In this case, adjectives are unnecessary, but the scenario measures whether additional components bring noise or not.

  5. All other objects that have the same category with a targeted object have the same set of features as the targeted object has. Hence, the targeted object is only distinguishable by its position. To solve the ambiguity, the object is described with a prepositional phrase that utilizes a referent object along with necessary adjectives.

  6. It is a random scenario from the above list.

5.2 Training

Figure 3: Learning curve of the neural modules.

To be able to measure the compositionality of learned modules, we have two different settings for the data generation. For training, we constrain the 75% of possible attributes for an object class and locations on the grid world that an instance of that object class can present. During testing, we use unconstrained samples generated for the same scenario. This way, we can evaluate if the model infers unseen word compositions, e.g. inferring red mug after seeing red book and black mug in the training time. We follow a curriculum schema to train our modules. Starting from the first scenario described in Section 5.1, we train the model on a stream of constrained randomly generated samples. We evaluate the model periodically on unconstrained samples generated for each period and continue training until the moving average error on the test data falls under a threshold (e.g. 1e-5 in our experiments). We then continue to train the model for the next scenario using learned weights. We set the number of nouns, adjectives and prepositions as 102, 26, and 27, respectively to match with the anchoring system. We use Adam Kingma and Ba (2014) with default parameters (i.e. , , ) for the optimization.

Figure 3 presents the learning curve of the model. The third graph (yellow) demonstrates that learning prepositions requires more data as compared to learning nouns (first graph) or adjectives (second graph). The reason for this behavior is twofold. First, the Shift modules have more weights to be learned than the Detect modules. Second, while the Detect modules have a one to one mapping between input and output, the Shift modules have many to many relations. There might be more than one active area in the input of a Shift module. Since it needs to remap highlighted areas on the grid to other areas , it needs to see different examples that occur in different parts of the grid world in order to learn to ignore the position of the active area.

Figure 4: We give the robot the instruction: ”pick up the ball in front of the can”. The robot executes the action and waits for further instructions. We then give the instruction to ”drop it in front of the mug”. The problem in this step is that there is no object classified as ‘mug’, which means that none of the objects has as label with highest probability ‘mug’. We correct for this through probabilistic reasoning over not only the top label for each object but a number of top ranked labels per object. This allows the anchoring system to correct its classification of an object based on what we as humans think an object is. Given the instruction, the anchoring system re-classifies the black object from ‘pot’ to ‘mug’. The instruction is then successfully carried out. The recorded video can be found here: https://vimeo.com/302072685.

The remaining graphs show the effectiveness of our design to compose learned modules. Since we do not train any modules from scratch, we can handle the composition of nouns, adjectives and prepositions effectively. Since it is the first time we train all components together in scenario 4, the training requires more data than one would expect when compared with graphs 5 (orange) and 6 (cyan).

6 Showcase

We now proceed with a demonstration of the integrated system: we have a Kinect camera that observes the world, the anchoring representation that builds up a representation of the world based on the raw image data, the language grounder that takes as input a natural language instruction and a probabilistic reasoning component that resolves possible inconsistencies between the robot’s world representation and the instruction.

The physical setup up is identical to the one depicted in the image in Figure 2: the robot arm is mounted on the opposite site of a kitchen table of the Kinect camera. The natural language instruction is passed to the language grounder via an instruction prompt. In each of the four panels in Figure 4, the instruction prompt is seen at the bottom as rectangular box. We further describe the scenario in the caption of Figure 4.

7 Related Work

Our work is related to two research domains: modular neural nets for language grounding and human-robot interaction for handling ambiguities in one or more modalities. Andreas et al. (2016b, a) introduced neural module networks for visual question answering. Johnson et al. (2017); Hu et al. (2017a) developed policy gradient based approaches to learn to generate layouts instead of using a dependency parser based method. Hu et al. (2017b); Yu et al. (2018); Cirik et al. (2018) applied modular neural networks approach on ‘Referring Expression Understanding‘ task. Das et al. (2018) demonstrated the usage of neural module networks in decision taking in a simulated environment. To our knowledge, the present study is the first work that uses neural module networks approach in the real-world robotic setting.

Learning from human interaction has been extensively studied. Lemaignan et al. (2011, 2012) developed a cognitive architecture that makes decisions by using symbolic information provided as facts (pre-defined) or extended via human-robot dialogues. When compared with our system, their system neither operates on the sensory input nor deals with the uncertainty in the world. Tellex et al. (2013) proposed a system to ask questions to disambiguate the ambiguities presented in the instructions. The robot decides the most ambiguous part of the command which is defined based on a metric derived from entropy and asks questions about it to reduce the uncertainty. They update the generalized grounding graph Kollar et al. (2013) with answers obtained from the user and use these to perform inference. In contrast, we fix the ambiguity present in the perceptual data. She and Chai (2017) proposed a system to learn to ask questions during the learning of verb semantics. They work on the Tell me Dave environment Misra et al. (2014)

. The work represents the environment as grounded state fluents (i.e. a weighted logic representation). In this work, language grounding is modeled as the difference between before and after state for an action sequence. They modeled the interactive learning as an MDP and solved it with reinforcement learning.

Thomason et al. (2015) proposed a system that learns the meaning of natural language commands through human-robot dialog. They represent the meaning of instructions with -calculus semantic representation. Their semantic parser starts with an initial knowledge and learns through training examples generated by the human-robot conversations. Their dialog manager is a static policy which generates questions from a discrete set of action, patient, recipient tuples. Padmakumar et al. (2017) improved this work with a learnable dialog manager. They train both the dialog manager and the semantic parser with reinforcement learning. This approach was further extended in Thomason et al. (2019), where the authors combine the approach in Thomason et al. (2015) and Thomason et al. (2017) to obtain a system that is capable of concept acquisition through clarification dialogues. Instead of asking questions, we implicitly fix the perception with the information hidden in instructions. A further difference to these works is that we learn the language component in a simulated offline step, whereas they deploy active online learning, starting from a limited initial vocabulary.

This is also related to the work of Perera and Allen

, who present a system that tries to emulate child language learning strategies by describing scenes to a robot agent, which has to learn actively new concepts. The authors deploy probabilistic reasoning to manage erroneous sensor readings in the vision system. Apart from the active learning approach, there is also a conceptual difference: in our work, we do not consider discrepencies between the perceptual system (anchoring) and the language grounder as errors in the perceptual system but simply as different models of the world.

444This view taps into the philosophical question of whether one can ever truly know the nature of an object, cf. thing-in-itself Kant (1878), for which we omit a discussion.

As mentioned in Section 1, the work related closest to our approach is presented in Mast et al. (2016). The authors base their work on geometric conceptual spaces Gärdenfors (2004), which situates their work in the sub-domain of top-down anchoring Coradeschi and Saffiotti (2000). The geometric conceptual spaces induce a probabilistic model-based language grounder. This enables a robot to reason probabilistically over a description of a scene, given by an other agent, and single out the object that is most likely being referred to. In contrast, we present an approach to perform Bayesian learning over a learned language grounding model and a bottom-up anchoring approach.

8 Conclusions and Future Work

We introduced the problem of belief revision in robotics based solely on implicit information available in natural language in the setting of sensor-driven bottom-up anchoring in combination with a learned language grounding model. This is in contrast to prior works, which study either explicit information or are based on top-down anchoring. We proposed a Bayesian learning approach to solve the problem and demonstrated its validity on a real world showcase involving computer vision, natural language grounding and robotic manipulation.

In future work we would like to perform a more quantitative analysis of our approach to which end it is imperative to circumvent the curse of dimensionality emerging in the Bayesian learning step (cf. Equation 15). It would also be interesting to investigate whether our approach is amenable to natural language other than instructions.

A main limitation of our current approach is the limited size of the predefined vocabulary. It would be more practicable if a robot were able to extend its vocabulary through the interaction with a human, i.e. through dialogue. A possible solution would be to learn a probabilistic model (which resolves inconsistencies between language and vision) that takes into account the possible of currently unknown vocabulary occurring. Such an approach would still allow us to learn the anchoring of objects and the language grounding separately, while learning a much richer model to resolve inconsistencies than the one described in this work.


This work has been supported by the ReGROUND project (http://reground.cs.kuleuven.be), which is a CHIST-ERA project funded by the EU H2020 framework program, the Research Foundation - Flanders, the Swedish Research Council (Vetenskapsrådet), and the Scientific and Technological Research Council of Turkey (TUBITAK). The work is also supported by Vetenskapsrådet under the grant number: 2016-05321 and by TUBITAK under the grants 114E628 and 215E201.