Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

09/08/2021 ∙ by Sumedh A. Sontakke, et al. ∙ University of Southern California 12

Humans excel at learning long-horizon tasks from demonstrations augmented with textual commentary, as evidenced by the burgeoning popularity of tutorial videos online. Intuitively, this capability can be separated into 2 distinct subtasks - first, dividing a long-horizon demonstration sequence into semantically meaningful events; second, adapting such events into meaningful behaviors in one's own environment. Here, we present Video2Skill (V2S), which attempts to extend this capability to artificial agents by allowing a robot arm to learn from human cooking videos. We first use sequence-to-sequence Auto-Encoder style architectures to learn a temporal latent space for events in long-horizon demonstrations. We then transfer these representations to the robotic target domain, using a small amount of offline and unrelated interaction data (sequences of state-action pairs of the robot arm controlled by an expert) to adapt these events into actionable representations, i.e., skills. Through experiments, we demonstrate that our approach results in self-supervised analogy learning, where the agent learns to draw analogies between motions in human demonstration data and behaviors in the robotic environment. We also demonstrate the efficacy of our approach on model learning - demonstrating how Video2Skill utilizes prior knowledge from human demonstration to outperform traditional model learning of long-horizon dynamics. Finally, we demonstrate the utility of our approach for non-tabula rasa decision-making, i.e, utilizing video demonstration for zero-shot skill generation.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Offline reinforcement learning has been of substantial interest to the community with continued efforts in attempting to teach RL agents to perform tasks simply from a corpus of expert demonstration data (

levine2020offline; kidambi2020morel; agarwal2020optimistic). While offline RL holds promise, it is challenging because it requires making counter-factual queries under distributional shift (i.e., the agent cannot explore the effects of hypothetical action sequences not present in the training data; levine2020offline). Additionally, current offline RL formulations make 2 constraining assumptions - first, they require domain coincidence, i.e., that the state and action spaces of the demonstrations and the downstream agent being trained coincide. This can be restrictive especially in applications to domains such as robotics, where expert demonstrations in the same domain may not be available. Consider, for example, attempting to teach a robot to perform medical surgery. In such a scenario, current offline RL formulations would fail as they would require a dataset of demonstrations from an expert robot performing surgery. However, generating such an expert agent using vanilla RL in a safety-critical application would be disastrous due to exploration - a classic chicken-or-egg conundrum. Instead, it is more likely that demonstrations from a human expert would be available, and thus we need new methods to become able to exploit them.

Second, offline RL assumes task coincidence, i.e., it attempts to train agents to perform the same tasks as made available in the demonstration dataset, e.g., a demonstration of a robotic manipulation task, say grasping, will result in a policy that enables an agent to grasp. Combined, these assumptions mean that offline RL assumes that the MDPs on which the expert demonstrates its behavior and on which the downstream agent are trained to behave in are the same.

Hence, these assumptions make offline RL difficult to apply to practical robotic scenarios. In this work, we attempt to relax these assumptions. We utilize the large corpus of online human video tutorials of complex long horizon tasks to teach a robotic agent to perform semantically meaningful behaviors in its own environment. Inspired by zhu2017unpaired, our work attempts to learn adaptible short-horizon motion representations - called events - using domain randomization on human demonstrations. We then utilize a small amount of environment-specific demonstration data to adapt this latent space for domain specific behavior.

We improve upon the state-of-the-art in the following ways:

Unsupervised Event Representation Learning: Learning temporal representations from demonstrations (chen2019towards, boggust2019grounding, tosi2020distilled) (e.g., skill learning, event detection, etc.) typically requires large datasets of demonstrations, with expensive human annotations for timestamps corresponding to each event. V2S , on the contrary, learns event representations without temporal supervision, i.e., it divides long-horizon trajectories into semantically meaningful subsequences, without access to any temporal annotations that splits these trajectories.

Domain and Task Invariance: V2S abstracts events from demonstrations of a variety of cooking tasks. Additionally, these videos originate from a number of sources, varying in camera angles, instructional styles, etc. Thus, through domain randomization, our architectures generate domain invariant event representation. Unsupervised skill learning typically has more restrictive assumptions (Shankar2020Discovering,eysenbach2018diversity,sharma2019dynamics) requiring that demonstration data originate from a single domain, with the same state and action spaces.

Offline and Reward-free Skill Learning: Unsupervised skill discovery (eysenbach2018diversity; sharma2019dynamics; xu2018neural), huang2019neural) also typically requires costly interactions with an environment to discover skill sequences. Such assumptions can be infeasible in domains such as healthcare, where active exploration may not only be impossible, but potentially dangerous. eysenbach2018diversity learn a large number of low-level sequences of actions by enforcing that the corpus of skills acquired is diverse. Similarly, sharma2019dynamics attempt to learn skills such that under a skill, subsequent transitions are almost deterministic in a given environment. V2S first discovers event representations from freely available human demonstration data, and subsequently adapts them to learn environment-specific skills.

Long Horizon Learning from Demonstration: Long-horizon tasks remain the bane of decision-making algorithms, especially in the offline-learning scheme, due to an aggregation of sub-optimal behaviors over a horizon (nicolescu2003natural

). Imitation learning (

esmaili1995behavioural, atkeson1997robot, schaal1997learning, pastor2009learning, peters2013towards, niekum2012learning) has shown how agents can learn simple tasks from demonstrations. More recently, schmeckpeper2019learning shows that agents can learn to maximize external reward using a large corpus of observation data, i.e., trajectories of states, and a relatively smaller corpus of interaction data, i.e., trajectories of state-action pairs. However, such approaches are restricted to short horizons, while V2S is able to generate skills for long-horizon tasks like cooking.

Multi-modal World Models: V2S learns representations for events occurring free-flowing tutorial videos utilizing both textual and visual inputs which are available in typical human demonstrations. Once adapted to a specific environment, it describes the model of the environment. We show that these World Models outperform typical model (higuera2018synthesizing; nagabandi2018neural; lakshminarayanan2016simple; chua2018deep) learning methods on multi-step prediction.

Incorporating Prior Knowledge into Decision Making: We propose a adaptation based method to incorporate prior knowledge into decision making - both for model-based and model-free RL. We pre-train a Backbone network on real world cooking data and subsequently learn environment-specific adapter functions to model dynamics of a kitchen environment. We show how the pre-training aids efficient dynamics learning and yields semantically meaningful representations.

2 Methods

Tutorial videos contain informative demonstrations of complex real-world tasks. These consist of humans acting as expert agents in an abstract Markov Decision Process (MDP). While we do not have access to the state and action spaces of such an abstract MDP, we do, however, have access to proxies for them through the video frames and textual commentary in the tutorial video, i.e.,

state action  is like video frame commentary.

These video demonstrations consist of several events which are described in words and viewed through short sequences of video frames. We utilize such real world human demonstration data to learn environment agnostic event representations. We do this using domain randomization - by training a multi-modal temporal auto-encoder-style architecture (called Backbone network

) on human cooking demonstrations consisting of a variety of cooking recipes and tasks. Additionally, our data comes from many sources - with a variety of camera angles, lighting, etc. The temporal autoencoder thus generates domain and task independent embeddings for sequences of videos and words. Thus, given a human demonstration of say, poaching eggs, our architecture can isolate semantically meaningful subsequences, like cracking an egg, pouring water, etc. These event representations encode both a sequence of observations in the domain of the human cooking videos, and the associated "action sequences" in the form of textual tutorial commentary.

We then utilize a small amount of demonstration data in the environment of a real robotic agent. This data consists of sequences of states and actions of an expert robot demonstrating related tasks in the environment; for example, the robot arm opening cabinet doors, etc., but not cooking. We then force the representations of these robotic demonstrations to be in same space as those of human cooking demonstrations. This is achieved using a pair of MDP Homomorphisms which map the robotic MDP to the human abstract MDP and vice-versa. The homomorphisms are learnt in a cyclical manner, thereby requiring no supervision during training. Using the homomorphisms, we can later translate cooking events (e.g., cracking an egg) into the target robotic space and vice-versa, resulting in zero-shot skill generation. Thus, state action skill is analogous to video-frame commentary event.

2.1 Event Representation Learning from Demonstration Videos

Figure 1: Backbone Network. V2S contains a backbone temporal autoencoder which learns a semantically-meaningful embedding space, encoding events that occur in natural free-flowing tutorial videos, without explicit temporal supervision for event start and end timestamps. Section 2 contains details of components.

Intuitively, we define an event as a short sequence of states which may occur repeatedly across several demonstration trajectories. Events have an upper limit on their length in time steps. They can be obtained from both a sequence of demonstration images (video data) () and from the associated textual description (). Given an event representation, an associated sequence (of words or images) can be obtained using a decoder :


where may correspond to the flattened embedding of words or images , and

is a Gaussian distribution (assume prior) with parameters generated by the neural network

. Thus, the resulting joint model mapped over trajectories factorizes as:


The functions and the transition function are approximated by sequence-to-sequence models (in this case transformers (vaswani2017attention)).
Encoding: An input sequence of video frames is downsampled to 200. The visual encoder generates a sequence of event representations such that each event . Similarly, textual events are also generated using seq2seq transformer models.
Decoding: We decode in a cross-modal manner, where the events abstracted from the visual domain are used to re-generate the textual description and vice-versa. In what follows, prime notation refers to a re-generated value. Thus, the visual events are used to regenerate words using and textual events are used to subsequently regenerate demonstration frame embedding .

Learning Objective: We emphasize that we do not require supervision for temporal segmentation, i.e., we do not require annotations which demarcate the beginning and ending of a event, both in language and in the space of video frame timestamps. Our approach uses several loss terms between network outputs to achieve our objective. The soft-DTW (cuturi2017soft) is used to compute the match between two sequences of varying length. It is calculated between several sequences to generate the pre-training loss term, .


We posit that this loss function provides the inductive bias necessary for learning the event latent space. The term

ensures reconstruction of demonstration frames from the textual events, while ensures the generation of textual description from visual events. aligns the textual and visual event spaces.

2.2 Skill Learning using Cyclical Homomorphisms

Figure 2: Distillation. We freeze the weights of the Backbone network and learn MDP homomorphisms from the MDP of the robotic kitchen domain to the abstract MDP of human demonstrations (), and vice-versa (), in a self-supervised manner. The latent space simultaneously contains event representations from human cooking videos and skills from the robotic domain.

After pre-training on cooking videos of human demonstrations using Eq. 8, the weights of the encoders ( and ) and decoders ( and ) are frozen. Subsequently, offline demonstration (i.e., sequences of states and actions) in the robotic domain is used. This data consists of demonstrations by an expert robot performing tasks in the environment. In our case, the robot demonstration data consists of how to open a microwave oven, open cabinets, turn on the light, etc. Adapter functions are then learnt which map demonstration trajectories of states and actions from a trained robot performing these tasks in the environment onto the same space of video and word embeddings as used for human cooking.

2.2.1 Skills and Environment Dynamics

As in keele1968movement

, we define a skill as a sequence actions that may be executed in and of itself, without sensory feedback. Additionally, when a skill is applied to an environment, it results in a sequence of transitions that uniquely identifies the skill. For example, a skill which lifts an object in an environment is identified by both the sequence of actions applied by the agent and the resultant sequence of transitions in the environment. Thus, a latent vector

contextualizes both the policy () and the subsequent model ():


Thus, the models over trajectories of states and actions in demonstration data factorize as:


2.2.2 Cyclical Homomorphisms

Figure 3: Cyclical Homomorphisms. During pre-training, V2S learns an embedding space which encodes events occurring in an abstract MDP in which the human demonstrator behaves. This is done using video frames and textual commentary which serve as proxies for states and actions. Subsequently, a pair of homomorphisms map the robotic MDP into and out of the abstract MDP. These are learnt using a reconstruction loss.

Upon inspection, one can find that models in Eq. 2 and in Eq. 5 consist of the same structure, i.e., knowledge of the latent representation and of the history of a sequence determines the transition (with the exception of the state models in Eq. 5 due to the Markov assumption, where rather than history, the current state is sufficient). Consider the pair of MDPs - in the robotic domain (, ) with state space , and action space and second, , the abstract MDP with state space and action space where and represent the spaces of video frames and words and ,, and are unknown mappings from text and videos to state and action representations (in what follows, knowing their exact form is not necessary). We exploit the shared structure between these MDPs to learn a pair of MDP homomorphisms - from the robotic MDP to the abstract MDP and vice-versa. As defined in ravindran2004approximate:

Definition 1 (MDP Homomorphisms)

A Deterministic MDP homomorphism from an MDP to an MDP is a tuple of functions, with:

  • the state embedding function, and

  • the action embedding function

such that the following identities hold:


Thus we learn a pair of MDP homomorphisms - and such that is minimized, where is a suitable distance metric in the space of MDPs.

2.2.3 Cyclical Homomorphic Objective

We learn each of the homomorphisms by learning the state and action embedding functions separately. This is done by freezing the weights of the encoders ( and ) and decoders ( and ) in the backbone network pre-trained on the human demonstrations. Subsequently, , , , and are learnt such that and (i.e., the forward homomorphism) map sequences of states and actions from demonstrations in the kitchen environment into the spaces of video frames and words respectively. Next the pre-trained encoders, i.e., ( and ), generate the latent skill vectors for these trajectories. The skill vectors are fed into the pre-trained decoders ( and ) to regenerate the sequences in the space of words and video frames. Finally, these sequences are fed back into the inverse MDP homomorphism (, and ) to generate the original input sequence (Fig. 3).


3 Experiments

Our architecture is versatile, in that it results in the simultaneous learning of a latent conditioned dynamics model of the environment (green arrows in Fig. 2) and a latent conditioned policy or skill network (yellow arrows in Fig. 2). Through experiments, we present three main thrusts - representation learning, dynamics learning and skill learning. We study the utility of our approach in learning adaptable event representations. We show that our work results in unsupervised analogy learning of motion sequences. Subsequently, we study the ability of our latent conditioned model to utilize prior knowledge to quickly learn a long-horizon model of its environment, outperforming several state-of-the-art sophisticated baselines. Finally, we study the ability of V2S to generate skills simply from human demonstration. We show that our agent performs complex motion behaviors in the robot kitchen environment akin to stirring, grasping, pouring, etc., which were demonstrated by humans in the cooking videos.

We train our backbone network on the YouCook2 (ZhXuCoCVPR18) dataset which comprises instructional videos for 89 unique recipes (22 videos per recipe) containing labels that separate the long horizon trajectories of demonstrations into events - with explicit time stamps for the beginning and end of each event along with the associated commentary. Subsequently, we train the cyclical homomorphic objective on demonstrations from d4RL dataset (fu2020d4rl) of the Franka Kitchen environment (gupta2019relay). The goal of the FrankaKitchen environment is to interact with the various objects to reach a desired state configuration. The objects the agent can interact with include the position of a kettle, flipping a light switch, opening and closing a microwave and cabinet doors, or sliding another cabinet door. The desired goal configuration for all 3 tasks is to complete 4 subtasks: open the microwave, move the kettle, flip the light switch, and slide open the cabinet door.

3.1 Long Horizon Dynamics Learning

Dynamics learning, especially in an offline manner is a challenging endeavour. Dynamics learning using expressive models such as neural networks has proven to be challenging due to uncertainty stemming from insufficient data (epistemic uncertainty) and from the inherent stochasticity of an environment (aleatoric uncertainty). Further, long horizon dynamics modelling has long remained the bane of model-based reinforcement learning systems. Without an adequate model to rely upon while planning in the long horizon, model-based RL systems, although interpretable and simple have fallen behind recent advances in model-free RL. The root cause of many failures of long-horizon planning is error aggregation, i.e., sub-optimal predictions at each time step during inference results in a trajectory of states that moves increasingly further from the ground truth transitions in an environment with time.

Here, instead of inferring a long-horizon trajectory of states in an auto-regressive manner by passing actions into the model one-by-one, we propose to feed a whole sequence of actions into V2S . V2S decodes a sequence of skill representations from such a sequence of actions and subsequently generates the expected trajectory of resultant states conditioned on a starting state.

We compare our approach to four popular model/dynamics learning approaches currently used as state-of-the-art. As defined in chua2018deep:
Probabilistic Neural Network (PNN):

A probabilistic NN is a network whose output neurons simply parameterize a probability distribution function, capturing aleatoric uncertainty. We use the negative log prediction probability as our loss function, i.e.,

and choose the output distribution to be Gaussian with a diagonal covariance matrix.
Determinstic Neural Network (DNN): A deterministic NN is a special case of a probabilistic network that outputs delta distributions centered around point predictions denoted by . It is trained using . MSE can be interpreted as

with a Gaussian model of fixed unit variance, but cannot be used in practice for propagation.

Ensembles - PE and DE: As in chua2018deep, we consider ensembles of -many bootstrap models, using to refer to the parameters of our model . Ensembles can consist of probabilistic NNs or deterministic NN both with effective probabilty distributions as .

V2S is pre-trained on the YouCook2 Zhou_2018_CVPR

demonstrations. Subsequently, we provide each of the baselines and V2S with a dataset of demonstrations in the robot kitchen environment. Each of the baselines and V2S are trained for 1, 5, and 10 epochs and subsequently evaluated on unseen data for a 2-step, 5-step and 180-step (full sequence) next-state prediction. We repeat training over 10 random seeds and report standard error across the seeds.

We find that our model is robust to long-horizon error aggregation. In Table 1, we compare the ability of our model to quickly adapt to dynamics of the kitchen environment when the backbone network is pre-trained on cooking videos. We find that our model outperforms all current state-of-the-art approaches in learning dynamics of the environment faster and maintaining performance over longer horizons.

Method 2-Step 5-Step Full Sequence
PNN: higuera2018synthesizing
 1 Epoch
 5 Epoch
 10 Epoch
DNN: nagabandi2018neural
 1 Epoch
 5 Epoch
 10 Epoch
DE: lakshminarayanan2016simple
 1 Epoch
 5 Epoch
 10 Epoch
PE: chua2018deep
 1 Epoch
 5 Epoch
 10 Epoch
V2S (ours)
 1 Epoch
 5 Epoch
 10 Epoch
Table 1: Long Horizon Dynamics Learning. We study the ability of V2S to learn long horizon models of an environment in an offline manner. We pre-train the Backbone networks and on the cyclical homomorphic objective for and epochs. We find that V2S performs up to 10 times better over long-horizon sequences (lower RMSE is better).

3.2 Unsupervised Analogy Learning

V2S is trained on two separate datasets - YouCook2 and d4RL-FrankaKitchen. During pre-training, the agent learns environment-agnostic event representations encoding the associated sequences of video frames and textual commentary from YouCook2. During homomorphism learning, it learns to map sequences of states and actions in the robotic environment into the same latent space using the d4RL-FrankaKitchen dataset. Thus, we obtain a shared latent space, which contains event representations from cooking videos and kitchen environment skill representations. In Fig. 5, we plot a reduced dimensional t-SNE plot (van2008visualizing). We then explore the overlapping latent vectors in the plot and decode them to visualize the analogies discovered by the architecture. We find that V2S models motion programs successfully across domains without any supervision. The model learns to pick up on analogies between a spreading motion in the cooking videos to a horizontal sliding motion in the kitchen demonstration sequences. In other instances, it discovers analogies between circular hinge motions in the kitchen environment and circular stirring motions in the cooking videos. We emphasize, no supervision was provided to map the individual domains to one another. The cyclical pair of MDP homomorphisms resulted in an unsupervised analogy discovery.

Figure 4: Unsupervised Analogy Learning. Using the cyclical homomorphisms, we embed events from the human cooking demonstration videos and skills from the robotic kitchen environment into the same latent space. We explore the representation learning capacity by finding overlapping regions of the latent space and exploring their semantic meaning. V2S produces semantically meaningful analogies.

3.3 Zero-shot Skill Generation

Figure 5: Qualitative Evaluation of Generated Skills. V2S generates several significant semantically meaningful skills merely from human demonstrations. These motions are learnt in a reward free manner and can be used for complex tasks. Click here to view gifs of discovered skills.
Figure 6: Quantitative Evaluation of Generated Skills. We study the time-warped sequence distance (lower is better) between demonstrations from an expert agent in the kitchen environment and the skills generated by V2S . We find that V2S generates skills closer to expert trajectories than eysenbach2018diversity, a state-of-the-art unsupervised skill learning approach.

As both video demonstration event representations and skill vectors are embedded in a shared latent space, we explore the ability of our architecture to generate useful and semantically meaningful skills in a zero-shot manner. To do this, we sample event representation vectors and pass them as input to the textual decoders and subsequently, to the action embedding of the Inverse Homomorphism , . The resultant actions are then applied to the environment to visualize how knowledge acquired from the cooking videos can be used to learn long horizon action sequences that are semantically meaningful.

We find that the model is able to generate complex skills that were never seen in the robotic demonstration data, but were demonstrated by humans in the cooking video data. For example, our model produces a robotic stirring motion both clockwise and counter-clockwise. Other skills include motion sequences that could be used for grasping, pouring, etc. if the robot was given extra artifacts like cups, water, etc. This link shows discovered skills from human demonstrations.

3.4 Quantitative Skill Assesment

In Fig. 6, we study the utility of the skills generated by V2S in being able to effectively manipulate objects and successfully complete tasks in the environment. To this end, we decode each of the generated skills from our architecture in terms of control signals and find their smallest sequence discrepancy in the demonstration data. This discrepancy calculation is performed using the Dynamic Time-Warping loss proposed by cuturi2017soft. This allows us to calculate the sequence matches between 2 sequences of varying lengths. We compare the quality of our skills to those generated by DIAYN (eysenbach2018diversity). We find that our skills are up to three times closer to demonstration trajectories than those generated by DIAYN. We repeat experiments over 6 random seeds.

4 Conclusion

We propose a reward-free approach to skill learning, which utilizes prior knowledge to aid decision-making in a complex environment. We show that our architecture results in powerful long-horizon models and semantically meaningful skills and uses human demonstration data to aid both. A drawback of our architecture is the size and training time (several GPU-months of training means significant energy expenditure); work towards leaner models will be beneficial. Additionally, there is still a gap between demonstration and generated skills (Fig. 6). Work towards bridging this gap is necessary. Use unnumbered first level headings for the acknowledgments. All acknowledgments go at the end of the paper before the list of references. Moreover, you are required to declare funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). More information about this disclosure can be found at:

Do not include this section in the anonymized submission, only in the final paper. You can use the ack environment provided in the style file to autmoatically hide this section in the anonymized submission.



The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? See Section LABEL:gen_inst.

  • Did you include the license to the code and datasets? The code and the data are proprietary.

  • Did you include the license to the code and datasets?

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Conclusion.

    3. Did you discuss any potential negative societal impacts of your work? See conclusion.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Code link in supplementary material

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? In code link in Supplementary Material.

    3. Did you include any new assets either in the supplemental material or as a URL? Supplementary Material has code. Sec 3.3 has link to Gifs.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Implementation Details

We down-sample video frames per trajectory to 200 frames and encode each frame with ResNet-32 (pretrained on MSCOCO dataset) (he2016deep) to a dimension embedding. Comments are encoded using BERT-base pre-trained embeddings with a hidden dimension. Each of the , , , modules consist of the Transformer (vaswani2017attention) Encoder with 8 hidden layers and 8-Head Attention which takes as input, a positionally-encoded sequence and outputs attention weights. It is then passed through a Transformer Decoder with 8 hidden layers to generate latent variables having dimension . The event length is 16 events per trajectory.

We keep the maximum number of events discovered to 16. These assumptions are based on the YouCook2 dataset statistics where the minimum number of segments were 5 and the maximum as 16. We train the network with Adam optimizer for 100 epochs with , and for all our experiments along with a batch-size of 128. We use 16x Nvidia A100 GPUs to train the backbone network. For the robotic data, we sample sequences of 180 states and 179 action as input. The batch size is fixed to 32. The training takes days for training the backbone and an additional days to train the various adapted versions.