PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks

by   Jiankai Sun, et al.

In this work, we study the problem of how to leverage instructional videos to facilitate the understanding of human decision-making processes, focusing on training a model with the ability to plan a goal-directed procedure from real-world videos. Learning structured and plannable state and action spaces directly from unstructured videos is the key technical challenge of our task. There are two problems: first, the appearance gap between the training and validation datasets could be large for unstructured videos; second, these gaps lead to decision errors that compound over the steps. We address these limitations with Planning Transformer (PlaTe), which has the advantage of circumventing the compounding prediction errors that occur with single-step models during long model-based rollouts. Our method simultaneously learns the latent state and action information of assigned tasks and the representations of the decision-making process from human demonstrations. Experiments conducted on real-world instructional videos and an interactive environment show that our method can achieve a better performance in reaching the indicated goal than previous algorithms. We also validated the possibility of applying procedural tasks on a UR-5 platform.



There are no comments yet.


page 1

page 5

page 7

page 8


Procedure Planning in Instructional Videosvia Contextual Modeling and Model-based Policy Learning

Learning new skills by observing humans' behaviors is an essential capab...

Procedure Planning in Instructional Videos

We propose a new challenging task: procedure planning in instructional v...

Improving Human Decision-Making by Discovering Efficient Strategies for Hierarchical Planning

To make good decisions in the real world people need efficient planning ...

Action-Sufficient State Representation Learning for Control with Structural Constraints

Perceived signals in real-world scenarios are usually high-dimensional a...

Robot Task Planning for Low Entropy Belief States

Recent advances in computational perception have significantly improved ...

Structured Scene Memory for Vision-Language Navigation

Recently, numerous algorithms have been developed to tackle the problem ...

Decision Making Problems with Funnel Structure: A Multi-Task Learning Approach with Application to Email Marketing Campaigns

This paper studies the decision making problem with Funnel Structure. Fu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Intelligent reasoning in embodied environments requires that an agent has explicit representations of parts or aspects of its environment to reason about [beetz2016ai]. As a generic reasoning application, action planning and learning is a crucial skill for cognitive robotics. Planning, in the traditional AI sense, means deliberating about a course of actions for an agent to take for achieving a given set of goals. The desired plan is a set of actions whose execution transforms the initial situation into the goal situation (goal situations need not be unique). Normally, actions in a plan cannot be executed in arbitrary sequence, but have to obey an ordering, ensuring that all preconditions of each action are valid at the time of its execution. In practice, there are two challenges for planning. First, typically not all information that would be needed is available. Planning is meant for real environments in which many parameters are unknown or unknowable. Second, even if everything for a complete planning were known, then it would very likely be so computationally intensive that the real world would run too slowly. Thus, many planning systems and their underlying planning algorithms accept the restrictive assumptions of information completeness, determinism, instantaneousness, and idleness [beetz2016ai, Sun_NSPS_CoRL20].

Fig. 1: PlaTe Overview. Given a visual observation as start and goal, the encoder extracts the feature about the planning trajectory. This transformer-based procedure planning model is responsible for learning plannable latent representations and actions .

Procedure planning in instructional videos [chang2020procedure] (as shown in Figure 1) aims to make goal-conditioned decisions by planning a sequence of high-level actions that can bring the agent from current observation to the goal. Planning in instructional videos is a meaningful task since the ability to perform effective planning is crucial for building an instruction following agent. Although learning from instructional videos is natural to humans, it is challenging for the AI system because it requires understanding human behaviors in the videos, focusing on actions and intentions. How to learn structured and plannable state and action spaces directly from unstructured videos is the key technical challenge of our task. Other challenges include: 1) Learning to make accurate predictions with high-dimensional observations is still challenging [NIPS2015_a1afc58c], especially for visually complex long-horizon tasks. 2) Appearance gap between the training and validation dataset could be large for unstructured videos. Thus, the agent needs to have the generalization capability. 3) These gaps lead to decision errors that compound over the steps.

It is crucial for autonomous agents to plan for complex tasks in everyday settings from visual observations [chang2020procedure]

. Although reinforcement learning provides a powerful and general framework for decision making and control, its application in practice is often hindered by the need for extensive feature and reward engineering 

[Huang_DeepDecision_CoRL2020]. Moreover, deep RL algorithms are often sensitive to factors such as reward sparsity and magnitude, making well-performing reward functions particularly difficult to engineer. In many real-world applications, specifying a proper reward function is difficult.

In this paper, we proposed a new framework for procedure planning from visual observations. We address these limitations with a new formulation of procedure planning and novel algorithms for modeling human behavior through a Transformer-based planning network PlaTe. Our method simultaneously learns the high-level action planning of assigned tasks and the representations of the decision-making process from human demonstrations.

We summarize our contributions as follows:

  • We proposed a novel method, Planning Transformer network (PlaTe), for procedure planning in instructional videos task, which enjoys the advantage of long-term planning.

  • We integrate Beam Search to PlaTe to prevent it from large search discrepancies, and eliminate the performance degradation.

  • Experimental results show that our framework outperforms the baselines in the procedure planning task on both a real-world dataset and an interactive environment. We also validated the possibility of applying procedural tasks on a real UR-5 platform.

Ii Related Work

Self-Attention and Transformer.

Transformer-based architectures, eschew the use of recurrence in neural networks and instead trust entirely on self-attention mechanisms to draw global dependencies between inputs and outputs. Self-attention 


is particularly suitable for procedure planning, which can be seen as a sequence modeling task. Compared with Recurrent Neural Networks (RNNs), long short-term memory (LSTM) 

[10.1162/neco.1997.9.8.1735] and gated recurrent neural networks [chung2014empirical], the advantages of self-attention includes avoiding compressing the whole past into a fixed-size hidden state, less total computational complexity per layer, and more parallelizable computations. In this paper, thanks to Transformers’ computational efficiency and scalability, we explore the possibility of marrying Transformer-based architecture for procedure planning.

Learning to Plan from Pixels. Another related work is learning dynamics models for model-based RL [hafner2019learning, 9361118]. Recent works have shown that deep networks can learn to plan directly from pixel observations in domains such as table-top manipulation [NEURIPS2018_08aac6ac, srinivas2018universal], navigation in VizDoom [pathak2017curiosity], and locomotion in joint space [ehsani2018let]. Universal Planning Networks (UPN) [srinivas2018universal] assumes the action space to be differentiable and uses a gradient descent planner to learn representations from expert demonstrations. Prior work [chang2020procedure] proposes the conjugate dynamics model to expedite the latent space learning, but suffers from compounding error. Without using explicit action supervision, causal InfoGAN [NEURIPS2018_08aac6ac] extracts state representations by learning salient features. Instead, our method builds a transformer-based model that is amenable to long-horizon planning, operating directly on real-world videos and handling the semantics of actions with sequential learning.

Learning from Instructional Videos. The interest has dramatically increased in recent years in understanding human behaviors by analyzing instructional videos [zhou2018towards, zhukov2019cross]. Event discovery tasks such as action recognition and temporal action segmentation [huang2016connectionist, chang2019d3tw, Pan_2019_CVPR_Workshops], state understanding [alayrac2017joint], and video summarization / captioning [sun2019videobert, zhou2018towards] study recognition of human actions in video sequences. The others works [wang2019progressive] perform egocentric action anticipation model the relationships between past, future events, and incomplete observations. Action label prediction [sener2019zero, Farha_2019_ICCV] addresses the problem of anticipating all activities within a time horizon. However, the correct answer is often not unique due to the large uncertainty in human actions. Unlike the previous works that predict "what is happening" or "what is going to happen", we focus on understanding decision-making processes given the start observation and the visual goal.

Fig. 2: PlaTe Framework. Given the start and goal visual observations, encoder outputs the latent representations . We set . The latent representation and predicted action are inferred using transformer-based action prediction model and state prediction model . During training, the ground-truth action and state are given. During inference, we use Beam Search to enhance the trained . The right part shows the Transformer architecture we used for planning model.

Iii Our Method: PlaTe

Iii-a Problem Setup

We consider a similar setup to [chang2020procedure]: given the start visual observation and a visual goal that indicates the desired outcome for a particular task. During training, we have access to the observation-action pairs that were collected by an expert attempting to reach the goals. When testing, only the start visual observation and a visual goal are given. Our objective is to plan a sequence of actions that can bring the underlying state of to that of . is the horizon of planning, which means the number of task-level action steps the model is allowed to take. Figure 1 shows a goal-oriented plannable example where the intermediate steps of performing a complex planning task are planned.

Our key insight is that the compounding error can be reduced by jointly learning the state and action representation with Transformer-based architecture. As shown in the overall architecture in Figure 2, the procedure planning problem is formulated as


In the following sections, we first discuss how to encode the latent semantic representation. Then, we will introduce how to solve the long-term procedure planning task with transformer-based architecture. Lastly, we will discuss how to apply learned representation to solve the procedure planning by integrating Beam Search.

Iii-B Latent Semantic Representation

First, we use the state encoder that encodes the visual observation to a latent semantic representation.

The remaining question is: how to learn a planning model to reconstruct the action sequence and corresponding latent state representation? We assume the underlying process in Figure 1

is a fully observable goal-conditioned Markov Decision Process (

), where is the state space, is the action space,

is the unknown transition probability distribution. We denote

as the action prediction model conditioned on the current state, previous action, and goal state, and as the state prediction model conditioned on the previous state, goal state, and previous action. They plan the sequence of actions and hidden states that serves as a path from the initial state to the goal state. In this way, we are able to factorizes the planning model as:


where we use the convention that , .

However, there are still several vital difficulties: I) compounding error, II) generalization capability from the training set to the validation set on the real-world dataset. We extend the Transformer framework to tackle these problems.

Iii-C Transition Transformer

We propose a transformer-based network architecture that can learn the action-state correlation and generate planning sequences. The overview of this architecture is shown in Figure 2. We introduce some critical design choices that assist in learning cross-modal correspondence and, more importantly, improve the accuracy of generated planning sequences. These choices include cross-modal transformer architecture, attention mechanism-causal attention [radford2018improving] vs. full attention [devlin2018bert] for each transformer and the supervision scheme. Our design choices are explained in detail below.

We introduce two cross-modal transformers: the action transformer , which learns the correspondence between previous action feature and state feature , and generates the action prediction ; the state transformer , which learns the correspondence between the state feature , and action feature and generates the future state prediction

. Attention is the core of transformer network 

[NIPS2017_3f5ee243, li2021learn]. We set up four different settings: causal attention [radford2018improving], and full-attention [devlin2018bert] with future- supervision. Specifically, the output of the attention layer, the context vector is computed using the query vector and the key value pair from the input with a mask via


where is the number of channels in the attention layer. The look-ahead-mask is a triangular matrix.

Iii-D Beam Search in Procedure Planning

Given an action transformer model parameterized by and an input , which contains the information of current state, previous action step, and goal state, the problem of procedure planning task consists of finding a action sequence such that , where is the set of all sequences. can be regarded as a sequence of tokens from vocabulary , where is the length of the sequence . Then can be factored as


In the context of searching for procedure planning, search discrepancy

means extending a partial sequence with a token that is not the most probable one. More formally, a sequence

is considered to have a search discrepancy at time step if


The discrepancy gap is the difference in log-probability between the most likely token and the chosen token [meister2020best]. At time step , the discrepancy gap is


To avoid long-term procedure planning from significant search discrepancies, we introduce the discrepancy-constrained Beam Search during the inference phase of procedure planning. Given a threshold , we modify , the set of possible action sequence at step , to only include the top one-token extensions in each beam. The action log-probability output by action prediction model is used as the score function. The inference algorithm is shown as Algorithm 1. In this way, we can eliminate the performance degradation.

Fig. 3: Attention Mechanism Comparison

. Causal models are often supervised to predict the immediate next future for each input tensor, while full attention predict the

future time steps from the last timestamp.

Iii-E Learning

As shown in Figure 2, we have three main components to optimize: state encoder , action prediction model , and state prediction model . We refer to the expert trajectory as and predicted trajectory as state-action pairs visited by the current planning model.

We optimize by descending the gradient in Equation 7.


where is the cross-entropy loss. In training, -step sequence is output once. In testing, single-step inference is made with Beam Search.

Input: sequence , maximum hypothesis length , beam nodes buffer , buffer size , scoring function , maximum beam size , planning model
      Output: searched sequence

for do
     for  do
         if  then
         end if
     end for
end for
Algorithm 1 PlaTe: Planning Inference Phase

Iv Experiments

In our experiments, we aim to answer the following questions: (1) Is PlaTe efficient and scalable to procedure planning tasks? (2) Can PlaTe learn to plan on the interactive environment? (3) Can PlaTe learn procedure planning that is robust in the real-world? To answer Question 1, we evaluate PlaTe on CrossTask, a real-world offline instructional video dataset. We show procedure planning with our algorithm performs better on the CrossTask dataset than previous methods. To answer Question 2, we compare PlaTe with baselines on ActioNet. We find that our method can perform procedure planning in an interactive setup while vastly outperforming baselines. To answer Question 3, we evaluate our method on a real-world UR-5 robot arm platform. We also perform an ablation study on attention mechanism design, input type, and action sequence length. It is worth noting that, compared with the baselines, our method has an advantage in long-term procedure planning due to the transformer-based architecture and discrepancy-constrained Beam Search.

Fig. 4: Procedure planning qualitative results on CrossTask. Procedure Planning qualitative results for Make Pancakes. The top row describes the correct action sequence required to "make pancakes". To examine our method’s robustness, We vary the start and goal observations to evaluate our method. The results show that our approach is robust to perform planning within different stages in the video.
Method Prediction Length Prediction Length
Success Rate (%) Accuracy (%) mIoU (%) Success Rate (%) Accuracy (%) mIoU (%)
Random 0.01 0.94 1.66 0.01 0.83 1.66
RB [sun2019videobert] 8.05 23.30 32.06 3.95 22.22 36.97
RL [janner2019trust] 8.25 24.20 33.25 4.16 23.29 38.63
WLTDO [ehsani2018let] 1.87 21.64 31.70 0.77 17.92 26.43
UAAA [Farha_2019_ICCV] 2.15 20.21 30.87 0.98 19.86 27.09
UPN [srinivas2018universal] 2.89 24.39 31.56 1.19 21.59 27.85
DDN [chang2020procedure] 12.18 31.29 47.48 5.97 27.10 48.46
PlaTe (Ours)
TABLE I: CrossTask Results. Our model significantly outperforms baselines with improvement in terms of success rate.

Iv-a Experimental Setup

Datasets. We evaluate PlaTe on an instructional video dataset CrossTask [zhukov2019cross] and an interactive dataset ActioNet [ActioNet], which is based on AI2-THOR [ai2thor]. For real-world UR-5 experiments, we collect a UR-5 Reaching Dataset which consists of 100 trajectories (2150 first-person-view RGB image and corresponding action pairs) as a training set and evaluate on a real UR-5 platform. We perform an ablation study on CrossTask dataset.

Baselines. We compare to the following methods:

- Random Policy. Random Policy selects an action randomly from the full actions candidate set, which serves as the empirical performance lower bound.

- Retrieval-based (RB) [sun2019videobert]. Inspired by Sun et al. [sun2019videobert], the procedure planning problem can be approached from a more static view: RB finds the nearest neighbor of the start and goal visual observations pair by querying the training set and then directly output the actions in between.

- Reinforcement Learning (RL) [janner2019trust]. We adapt Model-based Policy Optimization (MBPO) [janner2019trust], one of the most popular model-based RL (MBRL) algorithm as our offline RL baseline. We first learn the latent space and then use the L2-distance in the latent space as a reward for the RL algorithm. The farther a state is from the final goal in the latent space, the lower the reward is. It can be seen that, without Transformer architecture, MBRL performs not as well as our model.

- WLTDO [ehsani2018let].

WLTDO plans using a recurrent model for egocentric videos. Given two non-consecutive observations, WLTDO predicts the intermediate action sequence. We add a softmax layer to the original model to output discrete actions.

- UAAA [Farha_2019_ICCV]. UAAA is a two-step approach to infer the action labels in the observed frames with RNN-HMM architecture, and then to predict the future action labels using an auto-regressive model. UAAA is modified to condition on both the start visual observation frame and the visual goal frame as our baseline.

- Universal Planning Networks (UPN) [srinivas2018universal]. Aiming to learn a plannable representation using supervision, UPN assumes a continuous and differentiable action space to enable gradient-based planning by minimizing the supervised imitation loss. We adapt UPN by adding a softmax layer to output discrete actions.

- Dual Dynamics Networks (DDN) [chang2020procedure]. As the first work to propose the procedure planning task in instructional videos, DDN learns the conjugate dynamics model and forward dynamics model, and explicitly leverages the structured priors to perform sample-based planning.

Metrics. We use the following metrics:

- Success Rate. Success rate measures the correctness of entire action sequence. For the offline CrossTask dataset, the planned action sequence is considered successful only if each planned action matches the ground-truth. For the interactive ActioNet and UR5-Reaching dataset, a planned sequence of actions is considered successful only if the planned sequence of actions reaches the goal.

- Top-1 Accuracy. Top-1 accuracy measures the correctness of action at each time step. Accuracy is a relaxation of the success rate because it doesn’t require the whole sequence to match the ground-truth. Accuracy is averaged over individual actions to balance the effect of repeating actions.

- mIoU. Mean Intersection over Union (mIoU) is the least strict metric which used to capture the cases where the model can output the required actions but fail to discern the order of actions. Same as Chang et al. [chang2020procedure], we compute IoU between the set of planned actions and the set of ground-truth .

Implementation Details. We use the Transformer architecture [NIPS2017_3f5ee243] as the transition model with self-attention layers and

heads. The transition model, which is two-headed: one for action prediction the other for state prediction. The state encoder in our model is two fully-connected layers with [64,32] units in each layer and Leaky-ReLU as non-linearity function. During training, all models are optimized by Adam 

[kingma2014adam] with the starting learning rate of

. We train our model for 200 epochs with batch size of 256 on a single GTX 1080 Ti GPU.

Type T=3
Success Rate (%) Accuracy (%) mIoU (%)
beam width 14.18 35.92 52.36
beam width 15.33 37.67 58.08
visual start & lang. goal 16.05 36.46 60.51
lang. start & lang. goal 14.05 34.21 57.19
Causal-Attn. 15.23 34.34 58.15
Full-Attn. Future- 17.84 38.91 63.02
PlaTe (Ours)
TABLE II: Ablation Study on CrossTask.

Iv-B Evaluating Procedure Planning on CrossTask

First, we choose a real-world instructional video dataset CrossTask [zhukov2019cross] to conduct our experiments. CrossTask comprises videos ( hours). Each video depicts one of the primary long-horizon tasks—for example, Make Pancakes, Add Oil to Your Car or Make Lemonade. To test the trained agent’s generalization capability, for the videos in each task, we randomly divide the videos in each task into splits for training and testing. There are various procedure steps in different tasks. For example, the most simple task is "Jack Up a Car", which only contains 3 steps. Complex tasks such as "Grill Steak", "Make Bread and Butter Pickles", "Pickle Cucumbers", and "Change a Tire" require 11 steps to finish. For each video, there are dense temporal boundaries and action labels that describe the person’s actions in the video. Each video can be regarded as a sequence of images (where is the index of frames) that have annotated with a sequence of action labels and each action starts at frame index and ends at frame index . Same as the setup of [chang2020procedure]: we choose frames around the beginning of the captions as , caption description as the semantic meaning of action, and images nearby the end as the next observation . Here, controls the duration of each observation, and we set for all data we have used in our paper. Our state-space is the pre-computed features provided in CrossTask: each second of the video is encoded into a -dimensional feature vector , which is a concatenation of the I3D [8099985], Resnet-152 [he2016deep], and audio VGG features [7952132]. The action space is constructed by enumerating all combinations of predicates and objects, which provides action labels and is shared across all tasks. Our method is suitable for modeling longer trajectories, but we restrict the experiments to horizontal lengths to maintain a consistent comparison with state-of-the-art methods.

Recall that in procedure planning, given the start and goal observations and , the agent needs to output a valid procedure to reach the specified goal. As illustrated in Table I, as instructional videos’ action space is not continuous, the gradient-based planner of UPN cannot work well. By introducing Beam Search, our PlaTe has a better performance in terms of success rate, accuracy, and mIoU. By designing a model with transformer-based components, we show that our model outperforms all the baseline approaches on real-world videos.

In Figure 4, we visualize some examples of the predicted procedure planning results on CrossTask, where the task is to Make Pancake. Our model is able to predict a sequence of actions with correct ordering. Specifically, the most challenging step in Make Pancake is the "add flour" and "add sugar" step, where visual differences are not significant, and it can only be inferred from context and sequence relationships.

In Figure 5, we further show our model’s performance as the planning horizon increases. Our model consistently outperforms the RB, UPN, DDN baseline for success rate metric because PlaTe enjoys the advantage of the transformer-based architecture and discrepancy-constrained Beam Search to find the sequence of actions that reaches the goal.

Fig. 5: PlaTe consistently outperforms baselines as the horizon of planning increases. It can be seen that PlaTe is better at long-term planning.
Fig. 6: Procedure Planning qualitative results on ActioNet. Procedure Planning qualitative results for Wash Dishes. The top row describes the correct action sequence required to repot the plant. To examine our mode’s robustness, We vary the start and goal observations to evaluate our method.
Method ActioNet UR5
Prediction Length
Random 0.01 0.01 0.01 0.01
RB [sun2019videobert] 7.62 3.67 32 26
RL [chang2020procedure] 7.41 3.55 50 44
WLTDO [ehsani2018let] 1.61 0.69 42 38
UAAA [Farha_2019_ICCV] 1.99 0.79 44 40
UPN [srinivas2018universal] 2.27 1.02 44 38
DDN [chang2020procedure] 10.06 4.06 52 46
PlaTe (Ours)
TABLE III: Success Rate (%) of ActioNet and UR5.

Iv-C Ablation Study on CrossTask

We also conduct experiments on CrossTask dataset with variations of our model:

  • Beam width . We investigate the impact of beam width on Beam Search.

  • Causal Attention vs. Full Attention with Future- supervision.

  • Visual Input vs. Lang. Input. Instead of using visual observations, we replace the start and goal visual observations with language, respectively, and see what the impact is.

As shown in Table II, our model uses beam width , visual input and visual goal as observation, and Full-Attention Future-. Learning the model without beam search (beam width ) will significantly hurt the overall performance. With Beam Search, our model can select a number of best alternatives with the highest probability as the most likely possible choices for the time step. If beam width , the learned policy will excessively focus on the single-step action modeling, which is not desired for long-term planning. However, our results show that larger beam width leads to increasingly large early discrepancies. We also note that the increase in future- makes the performance slightly better. After replacing the visual observation with the language description, the performance is not as good as visual observations. This is due to the fact that text description contains less context information than visual observations.

We calculate the mean square error (MSE) between the encoder and the transition model from to as the compounding error. We compare PlaTe using Fully Connected (FC) Layer and Transformer architecture. The quantitative results of compounding error are reported as Figure 7.

Fig. 7: Compounding Error (CE) Comparison.

Iv-D Evaluating Procedure Planning on ActioNet

To further illustrate the effectiveness of our method, we experiment on a second dataset ActioNet [ActioNet]. ActioNet is an interactive end-to-end platform for data collection and augmentation of the task-based dataset in 3D environment. Comprising over hierarchical task structures and videos, this dataset contains narrated instruction videos across different scenes to give over video. There are a total of 34 action candidates. The narrated instruction videos describe diverse tasks such as Close the shower curtain. Compared with CrossTask, this dataset has more tasks, and the average trajectory length is longer.

As illustrated in Table III, Both WLTDO and UAAA perform similar to UPN, which can be seen as an RNN goal-conditional policy directly trained with imitation objectives. Our model combines the strengths of transformer-based architecture and low discrepancy Beam Search, enabling us to predict actions from long-term videos to outperform all the baseline approaches on all metrics.

In Figure 6, we visualize some procedure planning examples on ActioNet, where the task is Wash Dishes. Our full model is able to predict a sequence of intermediate actions with correct ordering. Specifically, the most challenging step in Wash Dishes is the "Put Object" and "Pickup Object". These two actions can be distinguished only by taking into account the temporal information.

Fig. 8: Procedure Planning qualitative results on UR5.

Iv-E Evaluating Procedure Planning on Real Robot

Previous Procedure Planning research has rarely reported experimental results in real robots. There remains a gap between offline training and real-world applications. To validate the possibility of applying procedural tasks in the real environment, we conduct experiments on "Reaching a block" (cf. Fig. 8) using a Universal Robot UR5 system. This task while easy in simulation, but can be difficult for real robot [gu2017deep, Chen_RED_ICRA20]. Our agent learns to achieve tasks by imitating expert demonstrations. After the RGB-D camera is installed in front of the manipulator, the observation space includes the raw RGB images at current position and goal position. The action space includes None, Up, Down, Left, Right, Forward, Backward. UR5 Reacher consists of episodes of interactions, where each episode is time steps long. The fingertip of UR5 Reacher is confined within a 3-dimensional 0.7m × 0.5m × 0.4m boundary. The robot is also constrained within a joint-angular boundary to avoid self-collision.

In the task of Reaching, the robot is required to start at the current observation and then move to a goal observation. Since the controller of the robot is imperfect, we consider a reach to be successful if the robot reaches within cm of the block. The UR5 robotic arm is controlled by human volunteers to reach the target, and thus 100 offline expert demonstration trajectories are generated. All methods are evaluated on a real UR5 robotic arm for 50 episodes. As shown in Table III, our approach outperforms the other methods. Even though other baseline strategy has succeeded in offline dataset, real robotic reaching lags far behind human performance and remains unsolved in the field of robot learning.

V Conclusion

To conclude, we propose a cross-modal transformer-based architecture to address the procedure planning problem, which can capture long-term time dependencies. Moreover, We propose to enhance the transformer-based planner with Beam Search. Finally, we evaluate our method on a real-world instructional video dataset, and an interactive environment. The results indicate that our method can learn a meaningful action sequence for planning and recover the human decision-making process. We also validated the possibility of applying procedural tasks on a real UR-5 platform.