Pay attention! - Robustifying a Deep Visuomotor Policy through Task-Focused Attention

09/26/2018 ∙ by Pooya Abolghasemi, et al. ∙ University of Central Florida 3

Several recent projects demonstrated the promise of end-to-end learned deep visuomotor policies for robot manipulator control. Despite impressive progress, these systems are known to be vulnerable to physical disturbances, such as accidental or adversarial bumps that make them drop the manipulated object. They also tend to be distracted by visual disturbances such as objects moving in the robot's field of view, even if the disturbance does not physically prevent the execution of the task. In this paper we propose a technique for augmenting a deep visuomotor policy trained through demonstrations with task-focused attention. The manipulation task is specified with a natural language text such as "move the red bowl to the left". This allows the attention component to concentrate on the current object that the robot needs to manipulate. We show that even in benign environments, the task focused attention allows the policy to consistently outperform a variant with no attention mechanism. More importantly, the new policy is significantly more robust: it regularly recovers from severe physical disturbances (such as bumps causing it to drop the object) from which the unmodified policy almost never recovers. In addition, we show that the proposed policy performs correctly in the presence of a wide class of visual disturbances, exhibiting a behavior reminiscent of human selective attention experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Several recent projects demonstrated the possibility of end-to-end training of deep visuo-motor policies that perform object manipulation tasks such as pick-and-place, push-to-location, stacking and pouring. These systems perform all the components of the task (vision processing, grasp and trajectory planning and robot control) using a deep neural network trained end-to-end by variations of deep reinforcement learning and learning from demonstration. While most demonstrations had been made in unstructured but relatively benign environments, our own experiments and personal communication with other researchers had shown that end-to-end trained visuomotor policies are highly vulnerable to

physical and visual disturbances. An example of physical disturbance is the robot arm being bumped such that it drops the manipulated object. The desired behavior would be for the robot to immediately notice this, change its trajectory, pick up the dropped object and continue with the manipulation task. Instead, with an otherwise reliably performing policy, we noticed situations where the robot arm, having lost the object, continued to go empty-handed through the full trajectory of the manipulation, recovering either much later, or not at all. A visual disturbance might involve distracting mobile objects appearing in the field of view of the robot. Clearly, if the visual disturbance prevents the execution of the task, for instance, by blocking the view of the manipulated object, it is acceptable for the robot to stop or even cancel the manipulation. There are, however, visual disturbances that should not prevent the execution of the task: for instance, hands waving in the visual field of the robot but not covering the manipulated object or the robot arm. We have found that in the case of an end-to-end learned policies, even such visual disturbances cause the robot to behave erratically – possibly due to the robot interpreting the situation as a state never encountered before.

In engineered robot architectures such problems can be dealt by developing explicit models of the possible disturbances which might allow the robot to reason around the situation. In deep learning systems, one possible brute-force solution is to gather more training data containing physical and visual disturbance events. In this paper we propose

task focused attention (TFA) as a technique for increasing the robustness of the end-to-end learned robot manipulator to physical and visual disturbances without the need of additional training data. The contributions of the paper are as follows:

  • We describe a novel architecture for a visuomotor policy trained end-to-end from demonstrations which feature a task focused visual attention system. The attention system is guided by a natural language description of the task and focuses on the currently manipulated object.

  • We show that, under benign conditions, the new policy outperforms a closely related baseline policy without the attention model over pick-up and push tasks using a variety of objects.

  • We show that in the case of a severe physical disturbance, when an external intervention causes the robot to miss the grasp or drop the already grasped object, the new policy recovers in the majority of situations, while the baseline policy almost never recovers.

  • We show that the task focused attention allows the policy to ignore a large class of visual disturbances that interfere with the task for the baseline policy. We show experimentally that the system exhibits the “invisible gorilla” phenomenon from the classic selective attention test.

  • The task-focused attention system can be trained offline, does not require additional training data and does not significantly increase the computational cost of the training of the controller.

Ii Related Work

A deep visuomotor policy for robotic manipulation transforms an input video stream (possibly combined with other sensory input) into robot commands by the means of a single deep neural network. Such a system had been first demonstrated in [1]

using guided policy search, a method that transforms policy search into supervised learning, with supervision provided by a trajectory-centric reinforcement learning method. In recent years several alternative approaches had been proposed using variations of both deep reinforcement learning and deep learning from demonstration (as well as combinations of these).

Deep reinforcement learning is powerful paradigm which, in applications where exploration can be performed in a simulated environment allowing millions of trial runs, can train systems that perform at superhuman level [2] even when no human knowledge is used for bootstrapping [3]. Unfortunately, for training visuomotor policies controlling real robots, it is very difficult to perform reinforcement runs on these scales. Even the most extensive projects could only collect several orders of magnitude lower number of experiments: for example, in [4] 14 robotic manipulators were used over the period of two months to gather 800,000 grasp attempts. Even this number of experimental tries are unrealistic in many practical settings.

Thus, many efforts focus on reducing the number of experimental runs necessary to train an end-to-end visuomotor controller. One obvious direction is to learn a better encoding of the input data, which can improve the learning rate. In [5] a set of visual features were extracted from the image to be used as state representation for a reinforcement learning algorithm.

Another direction involves the use of learning from demonstration instead (or in combination with) reinforcement learning. The demonstrations can be performed in real [6] or simulated [7, 8] environments. Meta-learning [9] and related approaches promise to drastically lower the amount of training data needed to learn a specific task from a class of related tasks (possibly, down to a single task specific demonstration). However, they still require a costly meta-learning phase.

An approach that is similar to ours in objective but different in implementation is described in [10]. Considering manipulation tasks, the authors implement two layers of attention. The first, a task independent visual attention identifies, semantically labels and localizes objects in the scene. This labeling relies on training an external labeled dataset, thus in this respect the approach is not “end-to-end”. The second, a task-specific attention is learned by selecting from the objects segmented by the task independent attention those objects that contribute most to the correct prediction of demonstrated trajectories.

Another question concerns the way in which the task is specified to the robot. Specifying the task in the form of a human readable sentence is a natural choice [11], as creating such a command is very easy for a human user. In the general case, however, translating a command to a task is not yet feasible with an end-to-end learned controller. In this paper, we assume the existence of the command, but only as an additional input that helps the creation of the task-focused attention. Alternative ways of specifying the task are possible. A purely visual specification was proposed in [12], where the user identifies a pixel in the image and specifies where it should be moved. A technique of control based on visual images was also demonstrated in [13].

Another component of our work has its roots in recent work on visual attention networks. These networks often appear as components of larger networks solving problems like image captioning [14, 15], visual question answering [16, 17, 18] or visual expression localization [19]. Although the applications are different, the role of attention networks, i.e., focusing on information-rich parts of the visual input, remains the same. Our proposed attention mechanism is most similar to [16]. However, in our model we train the attention network with word selection objective. The objective is to select some regions on a frame regarding a textual input, such that it be able to regenerate the words in the input sentence just based on the selected regions visual features.

Iii Approach

Iii-a Generic visuomotor policies with task independent vision

Fig. 1: Generic architecture of a visuomotor policy for manipulator control with the vision component being independent of the task. The red arrow and red highlighted text shows our proposed change - the vision module and the primary latent encoding is dependent on the task .

Deep visuomotor policies for manipulator control are neural network architectures that have as input an observation composed of an image and possibly other sensor data , a task (or goal) specification and output robot commands . The robot executes these commands, enacting a change in the external environment which creates a new observation , and the cycle repeats. Architecturally, most currently proposed systems follow variations of the generic model of Figure 1 which posits the existence of a primary latent encoding , the result of the processing of the input by a specialized vision network. This encoding, of dimensionality orders of magnitude smaller than the input, is then used by the motor network to generate the command

. Practical considerations often require that the vision net goes through extensive pre-learning. A frequently encountered approach is to transfer-learn from a general purpose vision module such as VGG-19 trained on ImageNet. An alternative that can use only the training data collected from the robot environment itself is to custom-train a variational autoencoder 

[20]

for the specific setting. The motor net often, but not always contains a recurrent neural network and is trained on a loss that favors the execution of the specified task

. This training might take several forms. In the case of RL we need a source of rewards the task need to specify reward function. If the task is specified by demonstrations, the training might be executed in a supervised fashion using a behavioral cloning loss.

While this generic architecture covers a wide range of possible models, they share the property that the primary latent encoding does not depend on the current task. Another way to put it: the way the network sees the world is not dependent of its current task.

Iii-B Pay attention! Making vision dependent on the task

The principal idea of this paper is that performance benefits can be obtained if we make the vision system pay attention to the current task. Humans are known to exhibit selective attention - when observing a scene with a particular task in mind, features of the scene relevant to the task are given particular attention, while other features are de-emphasized or even ignored. This had been illustrated in the famous experiments of Chabris and Simmons [21]. Most human observers, when told to count the number of times a basketball was passed among a group of people in white and black T-shirts, succeed in the counting, but fail to notice a participant in a gorilla suit which enters the scene. The gorilla is immediately noticed if the participants are told to look for it, or by people who are not assigned the counting task. The invisible gorilla experiment demonstrates the role of task-defined selective attention in human perception and is often cited as an example of the limitations of the human perception. However, we can also see it as an efficient way to optimize the information processing in the brain by focusing on items relevant to the current task. Note that selective attention is not selective blindness – the human perception system still sees the whole area albeit it concentrates on task relevant features. Features such as a blinking red light would be clearly noticed.

Our objective is thus to create a system that implements a selective attention similar to what human perception is doing: we want the robot to focus on the objects of the scene that are relevant to the current manipulation task. We conjecture that using this task-focused attention (TFA), will better represent the objects that are the subject of the attention, allowing for more precision in grasping and manipulation.

The highlighted changes Figure 1 describe the proposed approach: the task representation is an input to visual net as well, making the primary latent encoding dependent on the task .

Iv A Teacher Network for TFA

Fig. 2: Examples of task focused attention: (left) red plate, (center) blue box and (right) blue ring. Top row, original visual input, bottom row: the attention map applied to the original input.

We are considering robot manipulation commands expressed in natural language such as:

(a) Push the red plate to the left.

(b) Push the blue box to the left.

(c) Pick up the red ring.

The goal of the TFA is to identify the parts of the visual input where objects relevant to the task appear, that is, to focus the attention on the red plate, blue box and blue ring respectively (see Figure 2).

A TFA system could be trained as a supervised learning model, if we can create a sufficient amount of training data. However, this would require us to label with attention blobs an unrealistically large number of input images. Our approach is to generate our own labels by implementing a teacher network that provides training data for the controller. Our approach fits in the established technique of student-teacher network training models [22, 23, 24], with the qualification that the attention teacher only teaches one particular aspect of the final controller.

In the remainder of this section, we describe the implementation of a teacher network which can label the TFA as in Figure 2.

We divided the visual field into

regions. The visual attention we aim to obtain is a vector of scores

with a value for each of the regions. The higher the score, the more attention is paid to the specific region. In general, our goal is to focus the attention on a small number of regions.

A “brute force” approach for obtaining the attention values would be as follows. We can understand the input text at a high level, identify the referred objects, recognize and localize them in the visual fields, and finally, position the attention scores accordingly. This approach, however, would require a very complex network and extensive labeled data.

The approach allows us to train the TFA on unsupervised data. The principal idea is that the attention should be on those regions that allow us to reconstruct the input text based on those regions only. The overall architecture is described in Figure 3.

Fig. 3: Proposed visual attention network. The network uses the pre-trained VGG19 [25]

network’s convolution layer output as the visual spatial features. The attention module combines the spatial and textual features, and assigns one probability to each spatial region. To train the attention network, first we pool the visual features by the attention scores (weighted average), and second use an auxiliary words classifier to select the input text words based on the pooled visual features.

The first step is to encode the text and image inputs.

Text input: Let be the textual input with word one-hot indicators , where is the dictionary of the words in our dataset. One-hot vectors are insufficient and redundant representations. Thus, a word to vector encoding is needed.

(1)

where , and is the encoded word vectors length. To encode a whole sentence, we feed the series of word vectors to an LSTM. To obtain the text encoding we extract the last hidden state of the LSTM, , where is the LSTM’s cell size.

Visual input: To obtain the visual encoding, we divide the visual input into spatial regions and individually process them to extract visual features using a VGG19 network. The resulting spatial visual features will have the form where is the number of spatial regions and is the feature length of each region.

We combine the textual and visual encodings through a technique similar to [16]. We learn a mapping on both information types and combine them through an element-wise summation:

(2)

where and are mapping matrices, is element-wise summation. is the combination matrix of textual and visual inputs. Note that, is a vector while is a matrix. We augment the the vector by repeating it for times.

To compute the final attention map, the model must give high score to a few spatial regions.

(3)

where is trainable weights vector to assign a score to each region. The final is the vector containing all regions’ attention scores. We use a non-linearity to force the network to attend to a few number of regions.

Our attention network must be trained without any spatial annotation. The attention values are not an output to be learned in a supervised fashion, but a latent variable dependent on the input text (See Figure 3). The main idea which allows us to train the attention network, is that from the pooled spatial features based on the latent variable we should be able to reconstruct the input text word set :

(4)

Basically, given a frame and sentence, we force the network to select a few regions of the input frame, and reconstruct the input text just based on the selected regions. As a result, the only way that the network can reconstruct the original input text, is by selecting the relevant regions of the frame.

(5)

where

is a multi layer perceptron.

contains the predicted set of words. We optimize the entropy loss function

.

V The vision and motor networks

Fig. 4: The proposed visuo-motor architecture.

Our architecture follows the generic architecture for the visuomotor policy in Figure 1. It has a vision net that extracts a primary latent encoding and a motor net that transforms it into actions which in our case are joint angle commands. However, our architecture contains several specific decisions with the aim to take advantage of the availability of the text description of the current task and the TFA.

V-a The Vision Net

The objective of the vision network is to create a compact primary latent encoding that captures the important aspects of the current task. An ongoing problem is that the encoding needs to work within a certain limited dimensionality budget. Intuitively, general purpose visual features extracted from the image would waste space by encoding aspects of the image that are not relevant to the task. On the other hand, focusing only on the attention field might ignore parts of the image that are important for the task. For instance, in Figure 

2-right, the robot arm itself is not visible.

Our proposed architecture for the vision net is contained in the yellow highlighted area of Figure 4 and incorporates several techniques that allows it to learn a representation that efficiently encodes the parts of the input that are relevant to the current task. The overall architecture follows the idea of a VAE-GAN [26]: it is composed of an encoder, a generator and a discriminator. The primary latent encoding is extracted from the output of the encoder, the rest of the components are only used during training or for visualization or debugging purposes (as in Section VI-C. The ability to train the encoder to create a suitable latent encoding depends on the appropriate choice of inputs, loss functions and adversarial samples, as discussed below.

The vision net receives a raw input image and a representation of the object of interest in the form of one-hot vectors encoding the shape and color of the object (eg. =plate and =red). Both the encoder and the generator receive the object representation in addition to their usual inputs (a technique building on [27]):

(6)
(7)

Notice that a novel feature of the architecture is that the generator does not only create an approximation of the input but also an approximation of the task focused attention .

Unlike traditional GAN discriminators, the discriminator employed in our architecture performs a more complex classification [28]. It takes as input the generator output , and classifies the shape and color of the object of interest, as well as whether the image and attention map is real or fake, thus the output dimensionality is .

If the discriminator is receiving real images and attention models, it needs to classify them as the type of the object contained in them:

(8)

If receives raw and masked images generated by and it should classify them as fake:

(9)
(10)

Finally, if receives raw and masked images from but the latent representation is coming from a .

(11)

where is the softmax function applied on the outputs’ of the discriminator to turn them into class probabilities.

The overall loss of the discriminator is thus .

The training of GANs is notoriously unstable. A possible technique to improve stability is feature matching [29]– forcing to generate images that match the statistics of the real data. Here we used features extracted by the last convolution layer of for this purpose and we call it . The discriminator will try to extract features with the most discrimination power and if the generator can match those features it can help to improve the results [28]:

(12)
(13)

V-B The Motor Net

The motor net of our architecture (the lower gray box in Figure 4) contains both recurrent and stochastic components. It takes as input the primary latent encoding which is processed through a 3-layer LSTM network with skip connections [30]. The output of the LSTM is fed into a mixture density network (MDN) [31]. The output of the MDN provides the Gaussian kernel parameters , and the mixing coefficients which are passed through a softmax activation layer. The 7-dimensional vector describing the next joint angle is sampled from this mixture of Gaussians. The motor loss is calculated according to the MDN negative log-likelihood loss formula over the supervised data based on the demonstrations (behavioral cloning loss):

(14)

V-C Overall loss functions and training

In addition to the discussed loss functions, we also apply KL-divergence and cycle consistency[32] on the latent representation and reconstruction loss on the results.

(15)
(16)

where .

The whole objective of the network is to minimize the following loss function:

(17)
(18)

Using this loss function, the training was performed end-to-end on both the vision and motor losses. Due to GPU memory constraints, a separate fine tuning of the motor network was necessary after the vision net weights were learned.

Vi Experiments

Fig. 5: An execution of the pushing task with the sentence ”Push the red bowl from right to left”. Top row: original input image, middle row: reconstructed full image, bottom row: reconstructed TFA. Notice that visual disturbances such as the hand and the gorilla are do not appear in the reconstructed image.

We collected demonstrations for the tasks of picking up and pushing objects (Figure 6) using an inexpensive Lynxmotion-AL5D robot. We controlled the robot using a PlayStation controller. For each task and object combination we collected 150 demonstrations at a frequency of 10 Hz. The training data consisted of the joint commands plus the visual input recorded by a Playstation Eye camera mounted over the work area. The training data thus collected was used to train both the visual and the motor net. Note that this robot does not have proprioception – any collision or manipulation error needs to be detected solely from the visual input.

Fig. 6: Objects used in the picking up task (left) and pushing task (right)

Vi-a Performance under benign conditions

Object w/o TFA[6] with TFA
Pick up task success rate
Red Bowl 70% 70%
White Towel 50% 80%
Blue Ring 30% 60%
Black Dumbbell 40% 50%
White Plate 60% 80%
Red Bubble-Wrap 10% 40%
Overall 43% 63.3%
Pushing task success rate
Red Bowl 100% 100%
White Plate 10% 30%
Blue Box 10% 60%
Black-White QR-box 40% 60%
Overall 40% 62.5%
Rate of recovery after physical disturbance
Red Bowl 10% 70%
White Towel 10% 80%
Blue Ring 0% 60%
Black Dumbbell 0% 60%
White Plate 0% 40%
Red Bubble-Wrap 0% 40%
Overall 3% 58%
TABLE I: Experimental results

The first set of experiments study the performance of the visuomotor controller under benign conditions, that is, under situations when the robot is given a command and it is left alone to perform the task in an undisturbed environment. To compare our approach against a baseline, we have reimplemented and trained the network described in [6], which can be used in the same experimental setup, but it does not feature a task focused attention or adversarial training. Note however, that the success rates are not directly comparable with the one in that paper, due to the more complex objects used here and the absence of multi-task training.

The first two sections of Table I compare the performance of the two approaches for the pick-up and push tasks, for the individual objects and calculated over 10 experiments each. We note that the results depend very much on the type of the object being manipulated. The success rate for pushing the red bowl was 100% for both approaches, as this was a relatively deep bowl, which the robot learned to push by inserting its manipulator inside the bowl. On the other hand, the success rate for pushing the white plate was quite low, this being a slippery, round, low object that requires precise positioning of the robot arm. Overall however, the proposed architecture using task focused attention had at least matched, and usually outperformed the earlier approach on all objects.

Vi-B Recovery after physical disturbance

In the second series of experiments, we investigated the controller’s ability to recover from a physical disturbance. These experiments were done using the pick-up task. When the robot was about to pick up the object, we disturbed the robot by (a) pushing the object just when the robot was about to pick up or (b) forcefully taking away the object from the robot after a successful grasp. Under these situations we count as success if the robot notices the disturbance and recovers by redoing the grasp. We remind the reader that due to the limitations of the Lynxmotion-AL5D robot, the only way the robot can detect the disturbance is through its visual system.

The third section of Table I shows the experimental results. We notice that the results here are very different. In the absence of TFA the recovery rate is close to zero. In most cases, after loosing the object, the robot tried to execute the manipulation without noticing that it does not grasp the object. With the help of TFA, however, the robot almost always noticed the disturbance, turned back and tried to redo the grasp. This phenomena is illustrated in our supplementary video111https://www.youtube.com/watch?v=xdvNF_R_EkI. Averaged over all the objects, the recovery rate was only 3% for the baseline policy, while it was 58% for the policy with the TFA.

Vi-C Ignoring visual disturbances. The disappearing gorilla

End-to-end trained visuo-motor policies are vulnerable to the appearance of objects that had not been seen during training. Even if they do not interfere with the physical operation of the robot, such objects end up being represented in the primary latent encoding and thus push the state representation out of the subspace in which the learning took place. In these situations is that the policy either emits random commands or (at best) stops the robot.

The proposed architecture allows us to ignore many of the possible visual disturbances as long as they are not interfering with the robot arm or the manipulated object that is the current subject of the task focused attention. Experiments comparing the architecture to one without TFA confirm that this is indeed the case. Space limitations only allow us a qualitative example in this paper.

One way to study whether the policy ignores the visual disturbance is to reconnect the generator during test time as well, and study the reconstituted images (which are a good representation of the information content of primary latent encoding). Figure 5 shows the input image, the reconstructed image and the reconstructed attention. While the robot was executing the task of pushing the red bowl to the left, we added some disturbances such as waving a hand or inserting a cutout gorilla figure in the visual field of the robot.

Notice that in the reconstituted image, the hand and the gorilla disappear, while the robot itself and the subject matter is reconstructed accurately. As these disturbing visual objects are ignored by the encoding, the task execution proceeds without disturbance. While we must be careful about making claims on the biological plausibility of the details of our architecture, we note that the overall effect implements a behavior similar to the selective attention experiments of Chabris and Simmons [21], purely as a side effect of an architecture implemented for a completely different goal.

Vii Conclusion

In this paper we described a technique for augmenting a deep visuomotor policy learned from demonstration with a task focused attention model. The attention is guided by a natural language description of the task – it effectively tells the policy to “Pay Attention!” to the task and object at hand. Our experiments show that under benign situations, the resulting policy consistently outperforms a related baseline policy. More importantly, paying attention has significant robustness benefits. In severe adversarial situations, where a bump or human intervention forces the robot to miss the grasp or drop the object, the proposed policy recovers quickly in the majority of cases, while the baseline policy almost never recovers. In the case of visual disturbances such as moving foreign objects in the visual field of the robot, the new policy is able to ignore these disturbances which in the baseline policy often trigger erratic behavior.

Future work includes attention systems that can simultaneously focus on multiple objects, shift from object to object according to the requirements of the task, and work in severe clutter.

Acknowledgments: This work had been supported in part by the National Science Foundation under grant numbers IIS-1409823 and IIS-1741431. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References