Unsupervised Action Segmentation for Instructional Videos

by   AJ Piergiovanni, et al.

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling.



page 1

page 2

page 3

page 4


Unsupervised Discovery of Actions in Instructional Videos

In this paper we address the problem of automatically discovering atomic...

HAA500: Human-Centric Atomic Action Dataset with Curated Videos

We contribute HAA500, a manually annotated human-centric atomic action d...

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

In this paper, we introduce a new hierarchical model for human action re...

Inferring Temporal Compositions of Actions Using Probabilistic Automata

This paper presents a framework to recognize temporal compositions of at...

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Procedural knowledge, which we define as concrete information about the ...

Am I a Baller? Basketball Performance Assessment from First-Person Videos

This paper presents a method to assess a basketball player's performance...

Atomic Loans: Cryptocurrency Debt Instruments

Atomic swaps enable the transfer of value between the cryptocurrencies o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instructional videos cover a wide range of tasks: cooking, furniture assembly, repairs, etc. The availability of online instructional videos for almost any task provides a valuable resource for learning, especially in the case of learning robotic tasks. So far, the primary focus of activity recognition has been on supervised classification or detection of discrete actions in videos. However, instructional videos are rarely annotated with atomic action-level instructions. In this work, we propose a method to learn to segment instructional videos in atomic actions in an unsupervised way, i.e., without any annotations. To do this, we take advantage of the structure in instructional videos: they comprise complex actions which inherently consist of smaller atomic actions with predictable order. While the temporal structure of activities in instructional videos is strong, there is high variability of the visual appearance of actions, which makes the task, especially in its unsupervised setting, very challenging. For example, videos of preparing a salad can be taken in very different environments, using kitchenware and ingredients of varying appearance.

The central idea is to learn a stochastic model that generates multiple, different candidate sequences, which can be ranked based on instructional video constraints. The top ranked sequence is used as self-labels to train the action segmentation model. By iterating this process in an EM-like procedure, the model converges to a good segmentation of actions (Figure 1). In contrast to previous weakly [12, 4] and unsupervised [1, 8] action learning works, our method only requires input videos, no further annotations are used.

Figure 1: Overview: Our model generates multiple sequences for each video which are ranked based on several constraints (colors represent different actions). The top ranked sequence is used as self-labels to train the action segmentation model. This processes is repeated until convergence. No annotations are used.
Figure 2: Overview of the stochastic recurrent model which generates an output action per step and a latent state (which will in turn generate next actions). Each time the model is run, a different rule is selected, thanks to the Gumbel-Softmax trick, leading to a different action and state. This results in multiple sequences.

We evaluate the approach on multiple datasets and compare to previous methods on unsupervised action segmentation. We also compare to weakly-supervised and supervised baselines. Our unsupervised method outperforms all state-of-the-art models, in some cases considerably, with performance at times outperforming weakly-supervised methods.

Our contributions are (1) a stochastic model capable of capturing multiple possible sequences, (2) a set of constraints and training method that is able to learn to segment actions without any labeled data.

Related Work Studying instructional videos has gained a lot of interest recently [1, 11]

, largely fueled by advancements in feature learning and activity recognition for videos. However, most work on activity segmentation has focused on the fully-supervised case, which requires per-frame labels of the occurring activities. Since it is expensive to fully annotate videos, weakly-supervised activity segmentation has been proposed. Initial works use movie scripts to obtain weak estimates of actions

[9] or localize actions based on related web images [3]. Several unsupervised methods have also been proposed [1, 8, 15].

2 Method

Our goal is to discover atomic actions from a set of instructional videos, while capturing and modeling their temporal structure. Formally, given a set of videos of a task or set of tasks, the objective is to learn a model that maps a sequence of frames from any video to a sequence of atomic action symbols where is a set of possible action symbols.

In the unsupervised case, similar to previous works [1, 8], we assume no action labels or boundaries are given. Our model, however, works with a fixed -the number of actions per task (analogous to setting in -means clustering). This is not a very strict assumption as the number of expected atomic actions per instruction is roughly known.

Sequential Stochastic Autoregressive Model. The model consists of three components: where is a finite set of states, is a finite set of output symbols, and

is a finite set of transition rules mapping from a state to an output symbol and next state. Importantly, this model is stochastic, i.e., each rule is additionally associated with a probability of being selected. To implement this, we use fully-connected layers and the Gumbel-Softmax trick

[5]. The model is applied autoregressively to generate a sequence (Figure 2).

For a video as input, we process each RGB frame by a CNN, resulting in a sequence of feature vectors. The model takes each feature as input and concatenates it with the state which is used as input to produce the output action. Once applied to every frame, this results in a sequence of actions.

Figure 3: Multiple candidate sequences are generated and ranked. The best sequence according to the ranking function is chosen as the labels for the iteration.

Learning by Self-Labeling of Videos. In order to train the model without ground truth action sequences, we introduce an approach of learning by ‘self-labeling’ videos. The idea is to optimize the model by generating self-supervisory labels that best satisfies the constraints required for atomic actions. We first generate multiple candidate sequences, then rank them based on the instructional video constraints, which importantly require no labeled data. Since the Gumbel-Softmax adds randomness to the model, the output can be different each time is run with the same input, which is key to the approach. The ranking function we propose to capture the structure of instructional videos has multiple components: (1) Every atomic action must occur once in the task. (2) Every atomic action should have similar lengths across videos of the same task. (3) Each symbol should reasonably match the provided visual feature.

The best sequence according to the ranking is selected as the action labels for the iteration (Fig. 3), and the network is trained using a standard cross-entropy loss. We note that depending on the structure of the dataset, these constraints may be adjusted, or others more suitable ones can be designed. In Fig. 4, we show the top 5 candidate sequences and show how they improve over the learning process.

Action Occurrence: Given a sequence of output actions, the first constraint ensures that every action appears once. Formally, it is implemented as , where is 1 if is in otherwise 0.

Modeling Action Length: This constraint ensures each atomic action has a similar duration across different videos. The simplest approach is to compute the difference in length compared to the average action length in the video. We also compare to sampling the length from a distribution (e.g., Poisson or Gaussian). where computes the length of action in sequence

and is Poisson or Gaussian. The Poisson and Gaussian distributions have parameters which control the expected length of the actions in videos. The parameters can be set statically or learned for each action.

Modeling Action Probability: The third constraint is implemented using the separate classification layer of the network

, which gives the probability of the frame being classified as action

. Formally, , which is the probability that the given frame belongs to the selected action. This constraint is separate from the sequential model and captures independent appearance based probabilities.auto-regressive model. We find that using both allows for the creation of the action probability term, which is useful empirically.

We can then compute the rank of any sequence as , where weights the impact of each term. In practice setting and to and works well.

Learning Actions: To choose the self-labeling, we sample sequences, compute the cost and select the one that minimizes the above cost function. This gives the best segmentation of actions (at this iteration of labeling) based on the defined constraints.

Cross-Video Matching

The above constraints work well for a single video, however when we have multiple videos with the same actions, we can further improve the ranking function by adding a cross-video matching constraint. The motivation for this is that while breaking an egg can be visually different between two videos, the action is the same.

Given a video segment the model labeled as an action from one video, a segment the model labeled as the same action from a second video, and a segment the modeled labeled as a different action from any video, the cross-video similarity is computed using a triplet loss or a contrastive loss. As these functions are differentiable, they can be added to the loss or cost function or both.

Figure 4:

Candidate sequences at different stages of training. The sequences shown are the top 5 ranked sequences (rows) at the given epoch. The top one is selected as supervision for the given step. The colors represent the discovered action (with no labels).

Self-labeling Training Method. We now describe the full training method, which follows an EM-like procedure. In the first step, we find the optimal set of action self-labels given the current model parameters and the ranking function. In the second step, we optimize the model parameters (and optionally some ranking function parameters) for the selected self-labeling. After taking both steps, we have completed one iteration.

Method NIV (F1) 50Sal (Acc) BR (MoF) BR (Jac)
Supervised Baselines
VGG from  [1] 0.376 60.8 62.8 75.4
I3D 0.472 72.8 67.8 79.4
AssembleNet [14] 0.558 77.6 72.5 82.1
CTC [4][14] 0.312 11.9 72.5 82.1
HTK [7] - 24.7 - -
HMM + RNN [12] - 45.5 33.3 47.3
NN-Viterbi [13] - 49.4 - -
ECTC [4][14] 0.334 - 27.7 -
Uniform Sampling 0.187 - - -
Alayrac et al. [1] 0.238 - - -
Kukleva et al [8] 0.283 30.2 41.8 -
JointSeqFL, [2] 0.373 - - -
SCV [10] - - 30.2 -
Sener et al. [15] - - 34.6 47.1
Ours 0.457 39.7 43.5 54.4
Table 1: Results on the NIV (left column), 50-salads (50Sal) (middle) and Breakfast(BR) (right) datasets. We report metrics adopted from prior work per each dataset, where available.

Segmenting a video at inference: CNN features are computed for each frame and the learned model is applied on those features. During rule selection, we greedily select the most probable rule. Future work can improve this by considering multiple possible sequences (e.g., following the Viterbi algorithm).

3 Experiments

00footnotetext: For the weakly-supervised setting, we use activity order as supervision, equivalent to previous works.

We evaluate our unsupervised atomic action discovery approach on multiple video segmentation datasets: (1) 50-salads dataset [16], which contains 50 videos of people making salads (i.e., a single task). The videos contain the same set of actions (e.g., cut lettuce, cut tomato, etc.), but the ordering of actions is different in each video, (2) Narrated Instructional Videos (NIV) dataset [1], which contains 5 different tasks (CPR, changing a tire, making coffee, jumping a car, re-potting a plant), (3) Breakfast [6] which contains videos of people making breakfast dishes from various camera angles and environments.

Evaluation Metrics: We follow all previously established protocols for evaluation in each dataset. We first use the Hungarian algorithm to map the predicted action symbols to action classes in the ground truth. Since different metrics are used for different datasets we report the previously adopted metrics per dataset.

3.1 Comparison to the state-of-the-art

We compare to previous state-of-the-art methods on the three datasets (Table 1). Our approach provides better segmentation results than previous unsupervised approaches and even for some weakly-supervised methods.

Qualitative Analysis Fig. 4 shows the generated candidate sequences at different stages of learning. It can be seen that initially the generated sequences are entirely random and over-segmented. As training progresses, the generated sequences start to match the constraints. After 400 epochs, the generated sequences show similar order and length constraints, and better match the ground truth (as shown in the evaluation). Fig. 6 shows example results of our method.

3.2 Ablation experiments

Cost 50-Salads Brkfst
Randomly pick candidate 12.5 10.8
No Gumbel-Softmax 10.5 9.7
Occurrence () 22.4 19.8
Length () 19.6 17.8
() 21.5 18.8
27.5 25.4
30.3 28.4
29.7 27.8
33.4 29.8
Table 2: Ablation with cost function terms
Method chng CPR repot make jump Avg.
tire plant coffee car
Alaryac et al. [1] 0.41 0.32 0.18 0.20 0.08 0.238
Kukleva et al. [8] - - - - - 0.283
Ours VGG 0.53 0.46 0.29 0.35 0.25 0.376
Ours AssembleNet 0.63 0.54 0.381 0.42 0.315 0.457
Table 3: Comparison on the NIV dataset of the proposed approach on VGG and AssembleNet features.
Figure 5: F1 value for varying the number of actions used in the model, compared to prior work. The number in parenthesis indicates the ground-truth number of actions for each activity. Full results are in the sup. materials.

Effects of the cost function constraints. To determine how each cost function impacts the resulting performance, we compare various combinations of the terms. The results are shown in Table 2. We find that each term is important to the self-labeling of the videos111These ablation methods do not use our full cross-video matching or action duration learning, thus the performances are slightly lower than the our best results.. Generating better self-labels improves model performance, and each component is beneficial to the selection process. Intuitively, this makes sense, as the terms were picked based on prior knowledge about instructional videos. We also compare to random selection of the candidate labeling and a version without using the Gumbel-Softmax. Both alternatives perform poorly, confirming the benefit of the proposed approach.

Figure 6: Two example videos from the ‘change tire’ activity. The ground truth is shown in grey, the model’s top rank segmentation is shown in colors. NIV dataset.

Varying the number of actions. As is a hyper-parameter controlling the number of actions to segment the video into, we conduct experiments on NIV varying the number of actions/size of to evaluate the effect this hyper-parameter has. The results are shown in Figure 5. Overall, we find that the model is not overly-sensitive to this hyper-parameter, but it does have some impact on the performance due to the fact that each action must appear at least once in the video.

Features. As our work uses AssembleNet [14] features, in Table 3 we compare the proposed approach to previous ones using both features. As shown, even when using VGG features, our approach outperforms previous methods.


  • [1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016.
  • [2] Ehsan Elhamifar and Zwe Naing. Unsupervised procedure learning via joint dynamic summarization. 2019.
  • [3] Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In ECCV.
  • [4] De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Connectionist temporal modeling for weakly supervised action labeling. In ECCV, 2016.
  • [5] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
  • [6] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In CVPR, 2014.
  • [7] Hilde Kuehne, Alexander Richard, and Juergen Gall.

    Weakly supervised learning of actions from transcripts.

    CVIU, 2017.
  • [8] Anna Kukleva, Hilde Kuehne, Fadime Sener, and Jurgen Gall. Unsupervised learning of action classes with continuous temporal embedding. In CVPR, 2019.
  • [9] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. In CVPR.
  • [10] Jun Li and Sinisa Todorovic. Set-constrained viterbi for set-supervised action segmentation. In CVPR, 2020.
  • [11] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV.
  • [12] Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with rnn based fine-to-coarse modeling. In CVPR, 2017.
  • [13] Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall. NeuralNetwork-Viterbi: A framework for weakly supervised video learning. In CVPR.
  • [14] Michael Ryoo, AJ Piergiovanni, Mingxing Tan, and Anelia Angelova. Assemblenet: Searching for multi-stream neural connectivity in video architectures. In ICLR, 2020.
  • [15] Fadime Sener and Angela Yao. Unsupervised learning and segmentation of complex activities from video. In CVPR, pages 8368–8376, 2018.
  • [16] Sebastian Stein and Stephen J McKenna.

    Combining embedded accelerometers with computer vision for recognizing food preparation activities.

    In ACM Pervasive and ubiquitous computing, 2013.