Log In Sign Up

Stage Conscious Attention Network (SCAN) : A Demonstration-Conditioned Policy for Few-Shot Imitation

by   Jia-Fong Yeh, et al.

In few-shot imitation learning (FSIL), using behavioral cloning (BC) to solve unseen tasks with few expert demonstrations becomes a popular research direction. The following capabilities are essential in robotics applications: (1) Behaving in compound tasks that contain multiple stages. (2) Retrieving knowledge from few length-variant and misalignment demonstrations. (3) Learning from a different expert. No previous work can achieve these abilities at the same time. In this work, we conduct FSIL problem under the union of above settings and introduce a novel stage conscious attention network (SCAN) to retrieve knowledge from few demonstrations simultaneously. SCAN uses an attention module to identify each stage in length-variant demonstrations. Moreover, it is designed under demonstration-conditioned policy that learns the relationship between experts and agents. Experiment results show that SCAN can learn from different experts without fine-tuning and outperform baselines in complicated compound tasks with explainable visualization.


page 1

page 4

page 6

page 7

page 10

page 12

page 13

page 14


Humans can learn to perform unseen compound tasks from different experts with few nonidentical demonstrations. The number of researches on few-shot imitation learning (FSIL) increases rapidly to verify whether the machine also has this ability. The above challenging settings of FSIL problem are illustrated in Figure 1, and most of existing works only solve them partially. To overcome the complexity of environment and concomitant complicated training process, behavioral cloning (BC) is leveraged to mimics the experts. Previous works finn17a; Duan2017; Yu2018; Yu2019; dasari2020transformers; Bonardi2020 only support one demonstration at once, which limits the capability of models. We argue that retrieving knowledge from few demonstrations simultaneously can break through the limitations and lead to better performance when conducting FSIL under these challenging settings.

Figure 1: Schema of few-shot imitation learning (FSIL). In our FSIL problem, models need to solve the task in novel environment that is unseen during training. Few demonstrations are given to let models imitate. There are three challenges in our FSIL setting, (1) We conduct FSIL on compound tasks which contain multiple stages. (2) Demonstrations are length-variant. Each stage may locate at different timestamps. (3) Models need to learn the behavior from a different type of expert. None of previous works can solve these challenges concurrently. Moreover, learning from length-variant sequences is non-trivial, making our FSIL a challenging yet practical problem.

Meta-learning based methods finn17a; Yu2018; Yu2019 learn a meta-policy that takes state from current playout and outputs an action via BC. Before testing, the meta-trained policy uses expert demonstrations to adapt meta-parameters. Then, they update the policy several times based on the number of demonstrations. The main drawbacks of these methods are twofold: (1) fine-tuning is required, and (2) they need a learnable loss Yu2019 to map expert to agent explicitly.

To tackle these drawbacks, demonstration-conditioned (DC) based methods Duan2017; James2018; Bonardi2020; Shao2020ObjectDO; dasari2020transformers; dance21 apply a policy which predict actions conditioned on both states and demonstrations . Fine-tuning is optional for DC methods, because they are expected to behave by observing demonstrations. They implicitly learn the mapping since they have the information in concurrently when generating , even if experts are different from the agent. The primary objective of DC policies is to encode few demonstrations into a representative embedding. However, it is difficult to encode length-variant demonstrations. Hence, most DC works Duan2017; Shao2020ObjectDO; dasari2020transformers only contains one demonstration in and claim themselves one-shot imitation methods.

To handle demonstrations with variant lengths, some DC works apply task-embed techniques. The authors James2018; Bonardi2020 concatenate the first and last frames of each demonstration and average their features to generate task-embedding (alias sentence in their paper). Nevertheless, this method does not consider the importance of temporal transitions that are essential for policy learning and cannot provide efficient information in the case of long demonstrations given. Therefore, dance21 dance21 introduces a transformer-based network that considers both temporal and cross-demonstration information using attention mechanisms. They generate a task embedding by averaging the attention output of few demonstrations at each timestamp. The premise of this method is that frames at the same timestamp of each demonstration have similar knowledge. However, due to different initial states, the operating time of each stage could easily vary and result in temporal misalignment. Therefore, the model might be confused by the frame mixed with two distinct stages and degrade the performance.

We aim to design an attention mechanism that can identify important frames at different timestamp. Meanwhile, the attention mechanism should detect stages in compound tasks. A compound task containing multiple stages often appears in robotics problems. When solving compound tasks in FSIL problem, the policy needs to learn both perception and path planning. This makes solving compound tasks challenging. Yu2019 Yu2019 leverages an additional phase predictor to split the compound tasks, and then the policy only needs to adapt to each stage. The disadvantage is that the number of stages needs to be known in advance.

In this work, we propose a novel stage conscious attention network (SCAN). SCAN takes both demonstrations and playouts as input to learn the mapping due to the characteristic of DC method. Furthermore, SCAN applies novel stage conscious attention to let each playout frame has its attention score to each frame in the demonstrations. The frame features of demonstrations are weighted by attention scores to produce the same shape contexts. We then average them to generate the informative task-embedding. With stage conscious attention, SCAN can retrieves knowledge from few demonstrations simultaneously. Experimental results show that SCAN has a significant performance improvement compared to baselines with explainable visualizations. The overall contributions are summarized as follows:

  • Our work is the first DC method that solves FSIL problem under the settings of compound task, length-variant demonstrations, and learning from a different expert.

  • The novel stage conscious attention detects important frames of misalignment stages and is robust to length-variant demonstrations.

  • Extensive experiment results express proposed SCAN is powerful, and explainable visualization also proves the effectiveness of novel stage conscious attention.

Related Work

Few-shot Learning.

Few-shot learning (FSL) has become popular since collecting a huge amount of labeled data is difficult in most research problems. The objective of FSL is to infer the unlabeled data (query set) by leveraging few labeled data (support set). FSL is first studied on image classification. There are many influential and well-known metric-learning-based FSL methods, such as Matching network Vinyals16MN, Prototypical network Snell17PN, and Relation Network Sung18RN. They try to learn the relationship between support and query sets rather than inferring the unlabeled data directly. Moreover, optimization-based method Finn17MAML seeks a meta-parameter set that can quickly adapt to unseen tasks. Nowadays, FSL has been extended to many research fields. For instance, image segmentation SSegment20, object detection FSOD21, and imitation learning FSIL2020. Our work aims to develop a method that can learn unseen compound tasks with few-shot length-variant demonstrations from a different expert.

Few-shot Imitation Using RL/IRL.

Reinforcement learning (RL) methods assume the reward function of environments is known. But, it is difficult to design reward functions that give precise feedback to policies in real-world problems. Therefore, inverse RL (IRL) Ng99policyinvariance infers a reward function from few expert demonstrations. Then, IRL policy can be trained by interacting with the environment under the inferred reward function. In addition, modern IRL methods Ho16; reddy2020sqil are usually GAN-like goodfellow2014generative. They assign a high reward to the states from demonstrations but a low reward to the states from collected agent samples. Since these methods only use rewards of states to train their model, they can handle the demonstrations without actions, which differs from BC. However, both RL and IRL methods need to interact with the target environment to train the policy. In other words, the learned reward function (IRL) or the well-trained policy (RL) do not be applicable to novel environments. In our FSIL problem, policies can only use demonstrations to fine-tune but cannot interact with the novel environment before performance evaluation. Thus, most RL and IRL methods are not available in our work.

A recent work dance21, named demonstration-conditioned RL (DCRL), overcomes the limitation. DCRL requires interactions with environments in training. But, it solves FSIL tasks without fine-tuning in testing. Because DCRL is a DC policy method, it needs expert demonstrations from training environments to achieve fine-tuning-free. They store the tuples of (playout history, rewards, demonstrations) into the replay buffer to train the policy. The policy is a transformer-based architecture that its encoder generates task embeddings using cross-demonstration attention and the decoder predicts the actions. Moreover, the concept of ”demonstration conditioned” echoes our motivations. But, designing all reward functions in training environments is quite time-consuming. Thus, we only compare SCAN with BC-based methods.

Compound Task.

A compound task consists of multi-stage subtasks. The main challenge of compound tasks is that there is no signal (label) in demonstrations to identify each subtask when testing. Nevertheless, the policies for distinct subtasks are pretty different. Hence, it becomes impractical to use a single policy to solve compound tasks.

An intuitive solution is to link the relationship between subtask and primitive. The term primitive, which comes from the robotics field FLASH2005660; MANSCHITZ201597, represents single elementary movements and is widely used in compound task problems. To be specific, each subtask may corresponds to a single robot motion (primitive), such as pushing and grasping. Therefore, previous RL zeng2018learning; marzari2021hierarchical and IL works Yu2019; lee2019composing; pmlr-v119-lee20f assume these primitives are known before testing since the end effector (gripper) are only capable for these primitives. Then, they develop a hierarchical structure that trains policies for each primitive separately and use a high-level control network to decide which policy should be executed with the current observation. Phase predictors Yu2019 or task transition models lee2019composing; pmlr-v119-lee20f are proposed to identify whether a primitive is done. Unfortunately, most of these components cannot be trained from scratch with actor networks due to the hierarchical architecture. Instead of recognizing primitives with an additional model, SCAN detects which primitive the current observation locates by an attention module. Unlike above methods, SCAN does not need to know the primitives beforehand, and all its components can be trained end-to-end.

Few-Shot Imitation Learning (FSIL)

To emphasize, we treat FSIL problem as imitation learning under FSL setting. In previous works, a policy is (trained or) tested over several tasks that are essentially the same but with different objects in the environments. Thus, we use the base environments and novel environments in problem statement. Notations are listed in Table 1.

Problem Statement

A few-shot imitation learning (FSIL) problem is given by a base environment set and a novel environment set , where . Each environment in or contains a set of expert demonstrations (support set) and a set of agent playouts (query set) . In addition, samples in or are in the same environment with distinct initial/end states. Moreover, the expert could be humans or other robot arms compared to the agent in playouts.

A policy with parameter is meta-trained in from base environments and meta-tested in from novel environments . The policy needs to generate an action when receiving a current state from the playout . Then, the success rates of playout executions in all are used as performance evaluation. Thus, the playout samples in only contain the initial state , and the following states are provided according to the action taken by the policy. At last, the objective of FSIL problem is to maximize the performance expectation of the policy where the expectation is taken over . Furthermore, the policy can only fine-tune its parameters using the demonstrations in (if needed), which means no interaction with novel environments are allowed before performance evaluation.

Figure 2: Architecture of stage conscious attention network (SCAN).

Given a set of length-variant demonstrations (images only) and current playout data (images, end-effector (EE) cropped images, EE vector), SCAN uses a shared-weight

to extract feature embeddings (FEs) from demonstrations and from current playout. Meanwhile, another generates FEs of EE cropped images . Then, are passed into a bidirectional LSTM (BiLSTM, encoder) to generate that contains time information. In addition, the attention maps are the results of matrix multiplications of and a learnable matrix and each . Afterwards, attention maps matrix multiplies by each to get contexts. Until now, length-various demonstrations are projected to same shape contexts. The task embedding (mean of contexts) contains the knowledge that each playout state focuses on. We named this process as
stage conscious attention (SCA). At last, , , ee vector and task-embeddings are fed into ActorNet (decoder) for predicting actions.
Figure 3: Architecture of ActorNet.

In ActorNet, an action head concatenates four input embeddings and adds a 1D positional encoding. Following by the several dense layers,the action head computes positions and probabilities of open/close control. In the meantime, an inverse dynamics model concatenates time

and and as inputs and predict the actions using a similar architecture of action head. The inverse dynamics model helps the action head know the precise actions that cause changes between frames.
(a) pick & place (PP)
(b) pick & place & push (PPP)
Figure 4: Environments of PP and PPP task . PP and PPP task are compound tasks that contains two and three stages, respectively. Objects are unseen during training.
notation meaning
frame, crop frame
sequence of frames
sequence of crop frames
sequence of vector
sequence of state
sequence of action
set of
set of
: base, : novel
Table 1: Defined notations


We introduce the stage conscious attention networks (SCAN) in details. SCAN needs following inputs: expert demonstrations (without actions) and playout states that contains frames , end-effector (EE) cropped frames , and EE vector (position and open amount). These inputs are widely used in FSIL works. Moreover, we apply inverse kinematics to compute joint parameters of the agent. Therefore, SCAN only needs to compute two outputs for each playout state: target positions (x, y, z in continuous space) and the probabilities for EE open/closing. SCAN is composed of three main components, including visual heads, stage conscious attention, and an ActorNet. We describe these components in following paragraphs, and the overall architecture of SCAN is shown in Figure 2.

Visual Head.

The objective of visual head is to retrieve meaningful embeddings of RGB-D frames either from or . We leverage two resnet18 to extract RGB images and depth images separately, inspired by Shao2020ObjectDO. Moreover, since SCAN does not have object detection components, we insert a modified self-attention module at the head and tail of resnet18. The self-attention module is originally proposed in zhang2019selfattention

, we make the output dimension of the query-layer and key-layer the same as input dimensions and add the channel-wise softmax layer behind them. Detailed architecture is shown in supplementary.

Then, given a 4D frame inputs (sequence length for or for , channel , height , width ), the visual head splits it into RGB and depth images and use the corresponding resnet18 to generate feature embeddings (FEs). These two FEs are concatenated and passed over a dense layer to obtain the output with shape ( or , 128). As shown in Figure 2, a shared-weights extracts from and each from . Another extracts from .

Stage Conscious Attention (SCA).

We assume that each frames in each demonstration (support sample) has a different importance to each frame in playout (query sample). Unlike the cross-demonstration attention (temporal-awareness attention) in dance21, SCAN leverages a stage conscious attention that lets each frame of current playout has its own interpretation of each demonstration. The overall structure is shown in Fig. 2.

We feed each into a bidirectional LSTM (BiLSTM) to obtain that contains temporal information. Then, the general form of global attention motivated by luong2015effective is applied to generate an attention map between and FEs of playout . To be specific, is the result of matrix multiplied of and a learnable matrix and . The makes sure the has a same latent size of , which is useful when dealing with two different shape matrix. Then, a layer makes each row (attention scores at each frame in the demonstration given by each frame in current playout) of attention map are summed to 1. Next, the context is the result of matrix multiplication of and each . At last, the task-embedding is the mean of contexts .


ActorNet predicts actions in continuous space rather than discrete spaces. It allows the agent to perform more precise actions. The overall picture of ActorNet is shown in Figure 3. In ActorNet, there are two components, action head and inverse dynamics model. The action head concatenates four inputs (, , EE vector , ) and add 1D positional encoding to provide auxiliary time information. Then the action head predicts target positions and probability of open/close for each state in current playout (history), we denote the output as .

The is a vector of pairs for the time state in the playout history. The

is a vector which contains means of three uni-variate Gaussian distributions that generate positions at time

. We use an additional learnable vector

for the standard deviation of the distributions. Besides, the

is the probability of open/close control.

Afterwards, the inverse dynamic model concatenates time , and time , as input , and use the similar architecture of action head to predict actions for each state in playout history (except for the latest frame). The inverse dynamic model aims to help the action head know how actions cause frame changes.

To train SCAN, we use negative-log-likelihood (NLL) as loss functions.

calculates the NLL loss for the output of the predicted positions. Besides, could be or . And the is the vector of labeled actions.

We use the following loss for open/close control. The is true probability of open the gripper. And, indicates action head or inverse dynamics model .

The total loss is the weighted sum of all losses, and , are the hyper-parameters.


The goal of our experiments is to verify following assumptions: (1) the novel stage conscious attention has the ability to locate each primitive (stage) in length-variant demonstrations and highlight important frames for each playout frame. (2) SCAN can learn a relationship mapping between different types of experts and agent. (3) Based on above assumptions, SCAN can retrieve knowledge from few demonstrations simultaneously and get a boosted performance rather than separately handling each demonstration.

Experiment Settings.

We have two main experiments. (1) we evaluate all methods on two compound tasks, pick & place (PP) and pick & place & push (PPP), as shown in Figure 4. In PP task (2 stages), the agent needs to pick the cube and place it in the target bowl. Another bowl serves as the disruptor. Next, in PPP task (3 stages), the agent needs to push the sky-blue cup off the table after placing the cube. Note that the target bowl might be the front one or rear one. This means the directions of the trajectory are quite different, which makes our experiment challenging. Moreover, we follow the evaluation protocol in FSL problem. There are 56 novel environments composed of unseen objects. For each environment, we let methods play 20 times with different initial scenes and calculate the success rate. The average of successful rate and standard deviation in all novel environments are provided in Table 2. This experiment aims to evaluate whether SCAN can locate important frames and achieve dominating performance. (2) We design a extremely length-biased case to observe the robustness of methods when encountering sub-optimal demonstrations (still complete the task but with trivial moves) that have not been processed. To be clear, all methods are trained with optimal demonstrations. Then, we give few sub-optimal demonstrations in testing to analyze the relation between attention mechanism and performance changes, as shown in Figure 7 and 8. For all experiments, we build environments in CoppeliaSim and use the pyrep james2019pyrep toolkit to communicate with environments. We use Panda arm as the agent, and experts may be Panda arm or UR5.

Type Models Fine-tune PP task PPP task
1-shot 5-shot 1-shot 5-shot
same BC 01.70% 03.31% 03.30% 4.04%
meta-BC 12.86% 12.74% 28.84% 09.82% 05.54% 09.76% 28.57% 12.67%
TaskEmb James2018 35.09% 34.99% 34.11% 34.50% 17.77% 18.47% 18.39% 19.66%
54.20% 25.74% 83.39% 13.47% 47.23% 18.85% 58.39% 18.49%
TANet (ours) 28.75% 14.24% 40.75% 12.37% 46.43% 21.69% 48.13% 18.79%
53.93% 20.54% 64.82% 15.84% 52.05% 21.04% 68.57% 14.16%
SCAN (ours) 67.05% 21.31% 75.45% 17.66% 46.34% 13.18% 47.86% 13.69%
64.64% 21.50% 85.00% 10.69% 55.80% 20.24% 58.48% 22.56%
differ meta-BC 06.52% 09.95% 05.80% 08.17% 00.00% 00.00% 00.00% 00.00%
TaskEmb James2018 18.04% 09.81% 18.66% 09.61% 12.32% 14.70% 12.23% 15.12%
TANet (ours) 42.50% 14.82% 47.95% 14.63% 24.02% 26.30% 25.54% 27.20%
SCAN (ours) 60.89% 14.70% 65.27% 13.74% 31.52% 10.60% 32.41% 11.38%
Table 2: Success rate on compound tasks. The average of success rate and standard deviation in all novel environments are provided. The type column represents whether the expert and the agent are the same or not. In the same expert setting, model can fine-tune when expert actions are provided. From the table, we have three main findings. (1) the 5-shot performance usually outperforms the 1-shot performance. (2) fine-tuning is helpful in most cases, but it might let model overfit on the demonstration in one-shot setting. (3) SCAN has the best adaptation ability and performance except for the case of same expert in PPP task.

Compared Baselines.

Baselines are introduced below. All methods use our visual head and ActorNet for fair comparisons. Only the parts that handle few demonstrations are implemented. (1) BC: a conventional BC model takes states as input, no task-embedding generated. (2) meta-BC: a conventional BC is trained via MAML Finn17MAML in same expert setting and trained via DAML Yu2018 in different expert setting. (3) TaskEmb: a DC method James2018 averages concatenated embeddings of first and last frames as task-embeddings.(4) TANet: our implemented DC method that averages the output of cross-demonstration attention (at each timestamp) and apply the global attention to get the task-embeddings. The key idea of TANet is similar to the method in dance21, however, it is hard to build a transformer-based model with our visual head and ActorNet. Therefore, we design the TANet to evaluate the effectiveness of cross-demonstration attention.

Performance Comparison

We analyze the performance results of experiment 1 in Table 2. The success rate and standard deviation is the average of 56 novel environments. We have several observations from the results. (1) Except for TaskEmb, methods achieve better performance in 5-shot setting rather than 1-shot setting. TaskEmb only uses frame features when generating task-embedding, and there is no other conversion process. Therefore, it is susceptible to frame features from novel environments. Without fine-tuning, TaskEmb performance of 1-shot and 5-shot is not much different. (2) The performance of DC methods has been dramatically improved after fine-tuning. But SCAN has slightly worse performance in PP task under the 1-shot setting. We infer that using only one demonstration to fine-tune may let models overfit. Therefore, the generalization of models is reduced. (3) SCAN has the best adaptability in most cases, regardless of whether the expert is the same as the agent. We claim that the proposed SCA learns the mapping from demonstration to playout. And, the learned mapping can provide enough information for SCAN to behave in a novel environment even without fine-tuning.

Furthermore, we also observe an interesting phenomenon. TANet has trouble in PP task, even with fine-tuning. Because the time misalignment between demonstrations in PP tasks is serious, averaging information at each timestamp like TANet causes the task to be incomprehensible. Our SCAN identifies the location of critical frames in demonstrations. Therefore, SCAN can learn and behave smoothly in PP task. However, TANet outperforms SCAN in the PPP task under the same expert setting. We infer two possible reasons regarding this phenomenon: (1) Although demonstrations of PPP task have longer lengths and more steps, their time misalignment are not severe. (2) Bowls are at the front of the agent, and many bowls have similar colors to the agent in novel environments, which interferes with SCAN that needs to pay attention to each frame.

(a) PP task (2 primitives)
(b) PPP task (3 primitives)
Figure 5: Attention maps of SCAN on compound tasks. Each row and column represent the frame from the playout and the demonstration. Besides, a lighter cell has a higher score, and we mark each primitive with the same color. The attention results show that SCAN can focus on corresponding frames (beginning of same stage) in the demonstration when executing each stage (in both tasks).
(a) PP task (2 primitives)
(b) PPP task (3 primitives)
Figure 6: t-SNE of SCAN contexts. The X, Y axes are the projection of t-SNE, and the T-axis represents the timestamp. A context at time (datapoint at ) is generated by time playout frame and a demonstration. Besides, contexts from different demonstrations are tagged with distinct marker. We illustrate how the relationship between contexts changes over time. Since initial states of demonstrations are various, the contexts are diverse. Surprisingly, contexts aggregate after primitives are executed, which implies that the contexts indeed contain the task-related knowledge.

Effectiveness of Stage Conscious Attention (SCA)

Attention Result in Compound Tasks.

To verify whether SCA can detect crucial frames and generate informative task-embedding, Figure 5 and Figure 6 visualize attention maps and contexts generated by SCA in the compound tasks of experiment 1. To emphasize, all visualizations are under the setting of 5-shot and different experts. In Figure 5, attention scores of the stage where the agent is in progress focus on the same stage in demonstrations, whether the compound task has 2 or 3 stages. Notably, SCA locate critical frames in novel environments without fine-tuning. Furthermore, we do not use any hard restriction or loss function to guide SCA. It learns the ability proactively. On the other hand, Figure 6 illustrates t-SNE vandermaaten08a results of contexts and task-embeddings in the playout of Figure 5. The contexts come from different demonstration are tagged with distinct marker. In addition, we highlight stage locations in t-SNE results. At the beginning, generated contexts and task-embeddings are diverse since the initial states of demonstrations are various. Specially, contexts aggregate when each stage is over. The rest of contexts are generated during moving, such as moving to target cubes or bowls. It is impressive that contexts have these cluster attributes.

Robustness to Sub-optimal Demonstrations.

Figure 7 and Figure 8 demonstrate the result of experiment 2. We choose one from 56 novel environments as the target environment and generate the sub-optimal demonstrations. Experts still completed the task in sub-optimal demonstrations but with trivial movements. In other words, the demonstration set is extremely length-biased, which all methods have never encountered during training. Then, we run all methods 20 times for the performance evaluation. Attention maps of SCAN and TANet baseline in this experiment are shown in Figure 7. It is the comparison with zero or three sub-optimal demonstrations in the demonstration set. Both methods work well when there are no sub-optimal demonstrations (bottom). However, TANet cannot handle the case that there are three sub-optimal demonstrations (top). Because the time misalignment is severe in the case, TANet is hard to retrieve knowledge by averaging at each timestamp of demonstrations, which can be reflected in success rates of Figure 8.

In Figure 8, TANet performs poorly when the number of sub-optimal demonstrations increases. Our SCAN is also affected by sub-optimal demonstrations, but it did not cause such a big reduction in performance. Moreover, TaskEmb only focuses on first and last frames of demonstrations, and thus, sub-optimal demonstrations would not affect its performance. However, first and last frames cannot provide efficient information when solving compound tasks, TaskEmb has a lower performance compared to other methods.

Figure 7: Attention in extremely length-biased case. Methods are trained with optimal demonstrations, but the demonstration set contains sub-optimal demonstrations (with trivial moves) in testing. When there are all optimal demonstrations, both TANet and SCAN can find important frames. However, the attention result of TANet becomes a mess when there are 3 sub-optimal demonstrations. Because TANet averages attention by each timestamp, the information are mixed and hard to retrieve. By contrast, SCAN generates attention to each demonstration individually and then averages the results. Since unimportant frames has been filtered out, SCAN can still extract important information.
Figure 8: Model robustness in extremely length-biased case. We analyze the relation between success rate and the number of sub-optimal demonstrations. TANet cannot handle sub-optimal demonstrations since they average attention by timestamp. Its performance dramatically decreases while the number of sub-optimal demonstrations increases. In the contrary, our SCAN is more robust to the demonstrations with extremely different length. TaskEmb only considers the first and last frame, therefore, it does not matter in this case.


In this work, we conduct the FSIL problem under three challenging settings, including compound tasks, few length-variant demonstrations and learning from a different expert. Meanwhile, we found that most of works can only handle one demonstration at once or need external loss to learn from different experts. Hence, we propose a novel SCAN method that can retrieve knowledge from few demonstrations simultaneously and behave in novel environments without fine-tuning. Our stage conscious attention locates critical frames for each playout frame to alleviate the demonstration misalignment problem. Explainable visualization and outstanding performance illustrates the effectiveness of SCAN.


This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 110-2634-F-002-051 and Qualcomm Technologies, Inc. We are grateful to the National Center for High-performance Computing.