Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

05/27/2021 ∙ by Sateesh Kumar, et al. ∙ 4

We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

indicates joint first author. {sanjay,sateesh,awais,andrey,zeeshan,huy}@retrocausal.ai.

With the advent of deep learning, significant progress has been made in understanding human activities in videos. However, most of the research efforts so far have been invested in action recognition 

Tran et al. (2015); Carreira and Zisserman (2017); Wang et al. (2018); Tran et al. (2018)

, where the task is to classify simple actions in short videos. Recently, a few approaches have been proposed for dealing with complex activities in long videos, e.g., temporal action localization 

Shou et al. (2016, 2017); Chao et al. (2018); Zeng et al. (2019), which aims to detect video segments containing the actions of interest in an untrimmed video.

In this paper, we are interested in the problem of temporal activity segmentation, where our goal is to assign each frame of a long video capturing a complex activity to one of the action/sub-activity classes. One popular group of methods Kuehne et al. (2014, 2016); Lea et al. (2016); Chen et al. (2020); Li et al. (2020) on this topic require per-frame action labels for fully-supervised training. However, frame-level annotations for all training videos are generally difficult and prohibitively costly to acquire. Weakly-supervised approaches Huang et al. (2016); Richard and Gall (2016); Kuehne et al. (2017); Richard et al. (2017, 2018); Ding and Xu (2018); Chang et al. (2019); Li et al. (2019) which need weak labels in the form of an ordered action list for each training video have also been proposed. Unfortunately, these weak labels are not always available apriori and can be time consuming to obtain, especially for large datasets. To avoid the above annotation requirements, unsupervised methods Malmaud et al. (2015); Sener et al. (2015); Alayrac et al. (2016); Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) have been introduced recently. Given a collection of unlabeled videos, they automatically segment the videos and discover the actions by grouping frames across all videos into clusters, with each cluster corresponding to one of the actions.

Figure 1: (a) Previous approaches Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) in unsupervised activity segmentation often perform representation learning and clustering sequentially, while storing the embedded features for the entire dataset before clustering them. (b) We unify representation learning and clustering into a single joint framework, which processes one mini-batch at a time. Our method explicitly optimizes for unsupervised activity segmentation and is much more memory efficient.

Previous approaches Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) in unsupervised activity segmentation usually separate the representation learning step from the clustering step in a sequential learning and clustering framework (see Fig. 1(a)), which prevents the feedback from the clustering step from flowing back to the representation learning step. Also, they need to store the computed features for the entire dataset before clustering them in an offline manner, leading to inefficient memory usage. In this work, we present a joint representation learning and online clustering approach for unsupervised activity segmentation (see Fig. 1(b)), which uses video frame clustering as the pretext task and hence directly optimizes for unsupervised activity segmentation. We employ a combination of temporal optimal transport and temporal coherence loss to leverage temporal information in videos. Specifically, the temporal optimal transport module preserves the temporal order of the activity when computing pseudo-label cluster assignments, while the temporal coherence loss encourages temporally close frames to be mapped to spatially nearby points in the embedding space and vice versa. The combination of the above components yields effective representations for unsupervised activity segmentation. In addition, our approach processes one mini-batch at a time, thus has substantially lesser memory requirements (e.g., the 50 Salads dataset comprises 577,595 frames, while our mini-batch size is 512 frames only).

In summary, our contributions include:

  • We propose a novel method for unsupervised activity segmentation, which simultaneously performs representation learning and online clustering. We leverage video frame clustering as the pretext task, thus directly optimizing for unsupervised activity segmentation.

  • To exploit temporal cues in videos, we introduce the temporal optimal transport module, which enforces temporal order-preserving constraints on the computed pseudo-label cluster assignments. Further, we apply the temporal coherence loss, which encourages temporally close frames to be mapped to spatially nearby points in the embedding space and vice versa.

  • Our method performs on par with or better than the state-of-the-art in unsupervised activity segmentation on three public datasets, i.e., 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, while being significantly more memory efficient.

  • We collect and label our Desktop Assembly dataset, which we will make publicly available.

2 Related Work

In the following, we summarize related works in unsupervised activity segmentation and self-supervised representation learning.

Unsupervised Activity Segmentation. Early methods Malmaud et al. (2015); Sener et al. (2015); Alayrac et al. (2016) in unsupervised activity segmentation explore cues from the accompanying narrations for segmenting the videos. They assume the narrations are available and well-aligned with the videos, which is not always the case and hence limits their applications. Approaches  Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) which rely purely on visual inputs have been developed recently. Sener et al. Sener and Yao (2018) propose an iterative approach which alternatives between learning a discriminative appearance model and optimizing a generative temporal model of the activity, while Kukleva et al. Kukleva et al. (2019)

introduce a multi-step approach which includes learning a temporal embedding and performing K-means clustering on the learned features. VidalMata et al. 

VidalMata et al. (2021) and Li and Todorovic Li and Todorovic (2021) further improve the approach of Kukleva et al. (2019) by learning a visual embedding and an action-level embedding respectively. The above approaches Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) usually separate representation learning from clustering, and require storing the learned features for the whole dataset before clustering them. In contrast, our approach combines representation learning and clustering into a single joint framework, while processing one mini-batch at a time, leading to better results and memory efficiency. More recently, the concurrent work by Swetha et al. Swetha et al. (2021) proposes a joint representation learning and clustering approach. However, our approach is different from theirs in several aspects. Firstly, we employ optimal transport for clustering, while they use discriminative learning. Secondly, for representation learning, we employ clustering-based loss and temporal coherence loss, while they use reconstruction loss. Lastly, despite our simpler encoder, our approach has similar or superior performance than theirs on three public datasets.

Image-Based Self-Supervised Representation Learning. Since the early work of Hinton and Zemel Hinton and Zemel (1994), considerable efforts Vincent et al. (2008); Larsson et al. (2016, 2017); Noroozi et al. (2017); Liu et al. (2018); Kim et al. (2018); Gidaris et al. (2018); Carlucci et al. (2019); Feng et al. (2019) have been invested in designing pretext tasks with artificial image labels for training deep networks for self-supervised representation learning. These include image denoising Vincent et al. (2008)

, image colorization 

Larsson et al. (2016, 2017), object counting Noroozi et al. (2017); Liu et al. (2018), solving jigsaw puzzles Kim et al. (2018); Carlucci et al. (2019), and predicting image rotations Gidaris et al. (2018); Feng et al. (2019). Recently, a few approaches Bautista et al. (2016); Xie et al. (2016); Yang et al. (2016); Caron et al. (2018, 2019); Asano et al. (2019); Huang et al. (2019); Zhuang et al. (2019); Caron et al. (2020); Gidaris et al. (2020); Yan et al. (2020) leveraging clustering as the pretext task have been introduced. For example, in Caron et al. (2018, 2019), K-means cluster assignments are used as pseudo-labels for learning self-supervised image representations, while the pseudo-label assignments are obtained by solving the optimal transport problem in Asano et al. (2019); Caron et al. (2020). In this paper, we focus on learning self-supervised video representations, which requires exploring both spatial and temporal cues in videos. In particular, we follow the clustering-based approaches of Asano et al. (2019); Caron et al. (2020), however, unlike them, we employ temporal optimal transport and temporal coherence loss to leverage temporal information in videos.

Video-Based Self-Supervised Representation Learning. Over the past few decades, a variety of pretext tasks have been proposed for learning self-supervised video representations Hadsell et al. (2006); Mobahi et al. (2009); Bengio and Bergstra (2009); Zou et al. (2011, 2012); Srivastava et al. (2015); Goroshin et al. (2015); Vondrick et al. (2016); Misra et al. (2016); Lee et al. (2017); Fernando et al. (2017); Ahsan et al. (2018); Diba et al. (2019); Han et al. (2019); Kim et al. (2019); Gammulle et al. (2019); Xu et al. (2019); Choi et al. (2020). A popular group of methods learn representations by predicting future frames Srivastava et al. (2015); Vondrick et al. (2016); Ahsan et al. (2018); Diba et al. (2019) or their encoding features Han et al. (2019); Kim et al. (2019); Gammulle et al. (2019). Another group explore temporal information such as temporal order Misra et al. (2016); Lee et al. (2017); Fernando et al. (2017); Xu et al. (2019); Choi et al. (2020) or temporal coherence Hadsell et al. (2006); Mobahi et al. (2009); Bengio and Bergstra (2009); Zou et al. (2011, 2012); Goroshin et al. (2015). The above approaches process a single video at a time. Recently, a few methods Sermanet et al. (2018); Dwibedi et al. (2019); Haresh et al. (2021) which optimize over a pair of videos at once have been introduced. TCN Sermanet et al. (2018) learns representations via the time-contrastive loss across different viewpoints and neighboring frames, while TCC Dwibedi et al. (2019) and LAV Haresh et al. (2021) perform frame matching and temporal alignment between videos respectively. In this work, we learn self-supervised video representations by clustering video frames, which explicitly optimizes for the downstream task of unsupervised activity segmentation.

3 Our Approach

We now describe our main contribution, which is an unsupervised approach for activity segmentation. In particular, we propose a joint self-supervised representation learning and online clustering approach, which uses video frame clustering as the pretext task and hence directly optimizes for unsupervised activity segmentation. We exploit temporal information in videos by using temporal optimal transport and temporal coherence loss. Fig. 2 shows an overview of our approach. Below we first define some notations and then provide the details of our representation learning and online clustering modules, including temporal coherence loss and temporal optimal transport, in Secs. 3.1 and 3.2 respectively.

Notations. We denote the embedding function as

, i.e., a neural network with learnable parameters

. Our approach takes as input a mini-batch , where is the number of frames in . For a frame in , the embedding features of are expressed as , with being the dimension of the embedding features. The embedding features of are then written as . Moreover, we denote as the learnable prototypes of the clusters, with representing the prototype of the -th cluster. Lastly, and are the predicted cluster assignments (i.e., predicted “codes”) and pseudo-label cluster assignments (i.e., pseudo-label “codes”) respectively.

Figure 2: Given the frames , we feed them to the encoder to obtain the features , which are combined with the prototypes to produce the predicted codes . Meanwhile, and are also fed to the temporal optimal transport module to compute the pseudo-label codes . We jointly learn and by applying the cross-entropy loss on and and the temporal coherence loss on .

3.1 Representation Learning

To learn self-supervised representations for unsupervised activity segmentation, our proposed idea is to use video frame clustering as the pretext task. Thus, the learned features are explicitly optimized for unsupervised activity segmentation. In this work, we consider a similar clustering-based self-supervised representation learning approach as Asano et al. (2019); Caron et al. (2020). However, unlike their approaches which are designed for image data, we propose two main modifications including temporal optimal transport and temporal coherence loss to make use of temporal information additionally available in video data. Below we describe our losses for learning representations for unsupervised activity segmentation.

Cross-Entropy Loss. Given the frames , we first pass them to the encoder to obtain the features . We then compute the predicted codes with each entry written as:

(1)

where

is the probability that the

-th frame is assigned to the -th cluster and is the temperature parameter Wu et al. (2018). The pseudo-label codes are computed by solving the temporal optimal transport problem, which we will describe in the next section. For clustering-based representation learning, we minimize the cross-entropy loss with respect to the encoder parameters and the prototypes as:

(2)

Temporal Coherence Loss. To further exploit temporal information in videos, we add another self-supervised loss, i.e., the temporal coherence loss. It learns an embedding space following the temporal coherence constraints Hadsell et al. (2006); Mobahi et al. (2009); Goroshin et al. (2015), where temporally close frames should be mapped to nearby points and temporally distant frames should be mapped to far away points. To enable fast convergence and effective representations, we employ the N-pair metric learning loss proposed by Sohn (2016). For each video, we first sample a subset of ordered frames denoted by (with ). For each , we then sample a “positive” example inside a temporal window of from . Moreover, sampled for (with ) is considered as a “negative” example for . We minimize the temporal coherence loss with respect to the encoder parameters as:

(3)

Final Loss. Our final loss for learning self-supervised representations for unsupervised activity segmentation is a combination of the cross-entropy loss and the temporal coherence loss as:

(4)

Here, is the weight for the temporal coherence loss. Our final loss is optimized with respect to and

. The cross-entropy loss and the temporal coherence loss are differentiable and can be optimized using backpropagation. Note that we do not backpropagate through

.

3.2 Online Clustering

Below we describe our online clustering module for computing the pseudo-label codes online. Following Asano et al. (2019); Caron et al. (2020), we consider the problem of computing as the optimal transport problem and solve for online by using a mini-batch at a time. This is different from prior works Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) for unsupervised activity segmentation, which require storing the features for the entire dataset before clustering them in an offline fashion and hence have significantly more memory constraints.

Optimal Transport. Given the features extracted from the frames , our goal is to compute the pseudo-label codes with each entry representing the probability that the features should be mapped to the prototype . Specifically, is computed by solving the optimal transport problem as:

(5)
(6)

Here, and

denote vectors of ones in dimensions

and respectively. In Eq. 5, the first term measures the similarity between the features and the prototypes , while the second term (i.e., ) measures the entropy regularization of , and is the weight for the entropy term. A large value of usually leads to a trivial solution where every frame has the same probability of being assigned to every cluster. Thus, we use a small value of in our experiments to avoid the above trivial solution. Furthermore, Eq. 6 represents the equal partition constraints, which enforce that each cluster is assigned the same number of frames in a mini-batch, thus preventing a trivial solution where all frames are assigned to a single cluster. Although the above equal partition prior does not hold for activities with various action lengths, we find that in practice it works relatively well for most activities with various action lengths (e.g., see Fig. 5). The solution for the above optimal transport problem can be computed by using the iterative Sinkhorn-Knopp algorithm Cuturi (2013) as:

(7)

where and are renormalization vectors Cuturi (2013).

Temporal Optimal Transport. The above approach is originally developed for image data in Asano et al. (2019); Caron et al. (2020) and hence is not capable of exploiting temporal cues in video data for unsupervised activity segmentation. Thus, we propose to incorporate a temporal regularization term which preserves the temporal order of the activity into the objective in Eq. 5, yielding the temporal optimal transport.

Motivated by Su and Hua (2017), we introduce a prior distribution for , namely , where the highest values appear on the diagonal and the values gradually decrease along the direction perpendicular to the diagonal. Specifically, maintains a fixed order of the clusters, and enforces initial frames to be assigned to initial clusters and later frames to be assigned to later clusters. Mathematically,

can be represented by a 2D distribution, whose marginal distribution along any line perpendicular to the diagonal is a Gaussian distribution centered at the intersection on the diagonal, as:

(8)

with measuring the distance from the entry to the diagonal line. Although the above temporal order-preserving prior does not hold for activities with permutations, we empirically observe that it performs relatively well on most datasets containing permutations (e.g., see Tabs. 4, 4, and 5).

To encourage the distribution of values of to be as similar as possible to the above prior distribution , we replace the objective in Eq. 5 with the temporal optimal transport objective as:

(9)

Here, is the Kullback-Leibler (KL) divergence between and , and is the weight for the KL term. Note that is defined as in Eq. 6. Following Cuturi (2013), we can derive the solution for the above temporal optimal transport problem as:

(10)

where and are renormalization vectors Cuturi (2013).

4 Experiments

Implementation Details. We use a 2-layer MLP for learning the embedding on top of pre-computed features (see below). The MLP is followed by a dot-product operation with the prototypes which are initialized randomly and learned via backpropagation through the losses presented in Sec. 3.1. The ADAM optimizer Kingma and Ba (2014) is used with a learning rate of and a weight decay of . For each activity, the number of prototypes is set as the number of actions in the activity. For our approach, the order of the actions is fixed as mentioned in Sec. 3.2. During inference, cluster assignment probabilities for all frames are computed. These probabilities are then passed to a Viterbi decoder for smoothing out the probabilities given the order of the actions. Note that, for a fair comparison, the above protocol is the same as in CTE Kukleva et al. (2019), which is the closest work to ours. Due to space limits, please refer to the supplementary material for more implementation details.

Datasets. For evaluation, we use three public datasets (all under Creative Commons License), namely 50 Salads Stein and McKenna (2013), YouTube Instructions (YTI) Alayrac et al. (2016), and Breakfast Kuehne et al. (2014), while introducing our new dataset named Desktop Assembly:

  • The 50 Salads dataset consists of videos of actors performing a cooking activity. The total video duration is about hours. Following previous works, we report results at two granularity levels, i.e., Eval with action classes and Mid with action classes. For Eval, multiple action classes are merged into a single class (e.g., “cut cucumber”, “cut tomato”, and “cut cheese” are all considered as “cut”). Thus, it has less number of action classes than Mid. We use pre-computed features by Wang and Schmid (2013).

  • The YTI dataset includes 150 videos belonging to 5 activities. The average video length is about 2 minutes. This dataset also has a large number of frames labeled as background. Following previous works, we use pre-computed features provided by Alayrac et al. (2016).

  • The Breakfast dataset consists of 10 activities with about 8 actions per activity. The average video length varies from few seconds to several minutes depending on the activity. Following previous works, we use pre-computed features provided by Alayrac et al. (2016).

  • Our Desktop Assembly dataset includes 78 videos of actors performing an assembly activity. The activity comprises 22 actions conducted in a fixed order. Each video is about 1.5 minutes long. We use pre-computed features obtained from ResNet-18 He et al. (2016)

    pre-trained on ImageNet.

Metrics. Since no labels are provided for training, there is no direct mapping between predicted and ground truth segments. To establish this mapping, we follow Sener and Yao (2018); Kukleva et al. (2019)

and perform Hungarian matching. Note that the Hungarian matching is conducted at the activity level, i.e., it is computed over all frames of an activity. This is different from the Hungarian matching used in 

Aakur and Sarkar (2019) which is done over all frames of a video and generally leads to better results due to more fine-grained matching VidalMata et al. (2021)

. We adopt Mean Over Frames (MOF) and F1-Score as our metrics. MOF is the percentage of correct frame-wise predictions averaged over all activities. For F1-Score, to compute precision and recall, positive detections must have more than 50% overlap with ground truth segments. F1-Score is computed for each video and then averaged over all videos. Please refer to 

Kukleva et al. (2019) for more details.

Competing Methods. We benchmark against various methods Alayrac et al. (2016); Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021) for unsupervised activity segmentation. In particular, Frank-Wolfe Alayrac et al. (2016) explores accompanied narrations. Mallow Sener and Yao (2018) iterates between representation learning based on discriminative learning and temporal modeling based on a generalized Mallows model. CTE Kukleva et al. (2019) leverages time-stamp prediction for representation learning and then K-means for clustering. Recent works, i.e., VTE VidalMata et al. (2021) and ASAL Li and Todorovic (2021), further improve CTE Kukleva et al. (2019) by exploring visual cues (via future frame prediction) and action-level cues (via action shuffle prediction) respectively. Also, we compare against the concurrent work, i.e., UDE Swetha et al. (2021), which uses discriminative learning for clustering and reconstruction loss for representation learning.

4.1 Ablation Study Results

Variants F1-Score MOF OT 27.8 37.6 OT+CTE 34.3 40.4 OT+TCL 30.3 27.5 TOT 42.8 47.4 TOT+CTE 36.0 40.8 TOT+TCL 48.2 44.5
Table 1: Ablation study results on 50 Salads (i.e., Eval granularity). The best results are in bold. The second best are underlined.
Variants F1-Score MOF OT 11.6 16.0 OT+CTE 22.0 35.2 OT+TCL 24.8 35.7 TOT 30.0 40.6 TOT+CTE 26.7 38.2 TOT+TCL 32.9 45.3
Table 2: Ablation study results on YouTube Instructions. The best results are in bold. The second best are underlined.
(a) OT
(b) OT+CTE
(c) TOT
(d) TOT+TCL
Figure 3: Pseudo-label codes computed by different variants for a 50 Salads video. TOT and TOT+TCL are more effective at capturing the temporal order of the activity than OT and OT+CTE.

We perform ablation studies on 50 Salads (i.e., Eval granularity) and YTI to show the effectiveness of our design choices in Sec. 3. Tabs. 2 and 2 show the ablation study results. We first begin with the standard optimal transport (OT), without any temporal prior. From Tabs. 2 and 2, OT has the worst overall performance, e.g., OT obtains for F1-Score on 50 Salads, and for F1-Score and for MOF on YTI. Next, we experiment with adding temporal priors to OT, including time-stamp prediction loss in CTE Kukleva et al. (2019) (yielding OT+CTE), temporal coherence loss in Sec. 3.1 (yielding OT+TCL), and temporal order-preserving prior in Sec. 3.2 (yielding TOT). We notice that while OT+CTE, OT+TCL, and TOT outperform OT, TOT achieves the best performance among them, e.g., TOT obtains for F1-Score on 50 Salads, and for F1-Score and for MOF on YTI. The above observations are also confirmed by plotting the pseudo-label codes computed by different variants in Fig. 3. It can be seen that OT fails to capture any temporal structure of the activity, whereas TOT manages to capture the temporal order of the activity relatively well (i.e., initial frames should be mapped to initial prototypes and vice versa). Finally, we try adding more temporal priors to TOT, including time-stamp prediction loss in CTE Kukleva et al. (2019) (yielding TOT+CTE) and temporal coherence loss in Sec. 3.1 (yielding TOT+TCL). We observe that TCL is often complementary to TOT, and TOT+TCL achieves the best overall performance, e.g., TOT+TCL obtains for F1-Score on 50 Salads, and for F1-Score and for MOF on YTI. We notice that TOT+TCL has a lower MOF than TOT on 50 Salads, which might be because TCL optimizes for disparate representations for different actions but multiple action classes are merged into one in 50 Salads (i.e., Eval granularity).

Due to space constraints, we include the results of our ablation studies on the effects of (i.e., the weight for the temporal coherence loss in Eq. 4) and (i.e., the weight for the temporal regularization) in Eq. 9) in the supplementary material.

4.2 Results on 50 Salads

Tab. 4 presents the MOF results of different unsupervised activity segmentation methods on 50 Salads. From the results, TOT outperforms CTE Kukleva et al. (2019) by and on the Eval and Mid granularity respectively. Similarly, TOT outperforms VTE VidalMata et al. (2021) by and on the Eval and Mid granularity respectively. Note that CTE, which uses a sequential representation learning and clustering framework, is our most relevant competitor. VTE further improves CTE by exploring visual information via future frame prediction, which is not utilized in TOT. The significant performance gains of TOT over CTE and VTE show the advantages of joint representation learning and clustering. Moreover, TOT performs the best on the Eval granularity, outperforming the recent work of ASAL Li and Todorovic (2021) and the concurrent work of UDE Swetha et al. (2021) by and respectively. Finally, by combining TOT and TCL, we achieve on the Mid granularity, which is very close to the best performance of of ASAL. Also, TOT+TCL outperforms ASAL and UDE by and on the Eval granularity respectively. As mentioned previously, TOT+TCL has a lower MOF than TOT on the Eval granularity, which might be because multiple action classes are merged into one.

4.3 Results on YouTube Instructions

Here, we compare our approach against state-of-the-art methods Alayrac et al. (2016); Sener and Yao (2018); Kukleva et al. (2019); VidalMata et al. (2021); Li and Todorovic (2021); Swetha et al. (2021) for unsupervised activity segmentation on YTI. Following all of the above works, we report the performance without considering background frames. Tab. 4 presents the results. As we can see from Tab. 4, TOT+TCL achieves the best performance on both metrics, outperforming all competing methods including the recent work of ASAL Li and Todorovic (2021) and the concurrent work of UDE Swetha et al. (2021). In particular, TOT+TCL achieves for F1-Score, while ASAL and UDE obtain and respectively. Similarly, TOT+TCL achieves for MOF, while ASAL and UDE obtain and respectively. In addition, although TOT is inferior to TOT+TCL on both metrics, TOT outperforms a few competing methods. Specifically, TOT has a higher F1-Score than UDE Swetha et al. (2021), VTE VidalMata et al. (2021), CTE Kukleva et al. (2019), Mallow Sener and Yao (2018), and Frank-Wolfe Alayrac et al. (2016), and a higher MOF than CTE Kukleva et al. (2019) and Mallow Sener and Yao (2018).

Approach Eval Mid CTE Kukleva et al. (2019) 35.5 30.2 VTE VidalMata et al. (2021) 30.5 24.2 ASAL Li and Todorovic (2021) 39.2 34.4 UDE Swetha et al. (2021) 42.2 - Ours (TOT) 47.4 31.8 Ours (TOT+TCL) 44.5 34.3
Table 3: Results on 50 Salads. The best results are in bold. The second best are underlined.
Approach F1-Score MOF Frank-Wolfe Alayrac et al. (2016) 24.4 - Mallow Sener and Yao (2018) 27.0 27.8 CTE Kukleva et al. (2019) 28.3 39.0 VTE VidalMata et al. (2021) 29.9 - ASAL Li and Todorovic (2021) 32.1 44.9 UDE Swetha et al. (2021) 29.6 43.8 Ours (TOT) 30.0 40.6 Ours (TOT+TCL) 32.9 45.3
Table 4: Results on YouTube Instructions. The best results are in bold. The second best are underlined.

4.4 Results on Breakfast

Approach F1-Score MOF Fully Sup. HTK Kuehne et al. (2014) - 28.8 TCFPN Ding and Xu (2018) - 52.0 GMM+CNN Kuehne et al. (2017) - 50.7 RNN+HMM Kuehne et al. (2018) - 61.3 MS-TCN Farha and Gall (2019) - 66.3 SSTDA Chen et al. (2020) - 70.2 Weakly Sup. HTK Kuehne et al. (2014) - 25.9 ECTC Huang et al. (2016) - 27.7 Fine2Coarse Richard and Gall (2016) - 33.3 GRU Richard et al. (2017) - 36.7 NN-Viterbi Richard et al. (2018) - 43.0 D3TW Chang et al. (2019) - 45.7 CDFL Li et al. (2019) - 50.2 Unsupervised Mallow Sener and Yao (2018) - 34.6 CTE Kukleva et al. (2019) 26.4 41.8 VTE VidalMata et al. (2021) - 48.1 ASAL Li and Todorovic (2021) 37.9 52.5 UDE Swetha et al. (2021) 31.9 47.4 Ours (TOT) 31.0 47.5 Ours (TOT+TCL) 30.3 38.9
Figure 4: Results on Breakfast. The best results are in bold. The second best are underlined.
(a) P29_cam01_P29_cereals (b) P30_cam02_P30_sandwich
Figure 5: Segmentation results obtained by various methods for two Breakfast videos. Our results are more closely aligned with the ground truth than those of CTE Kukleva et al. (2019).

We now discuss the performance of different methods on Breakfast. Tab. 5 shows the results. It can be seen that the recent work of ASAL Li and Todorovic (2021) obtains the best performance on both metrics. ASAL Li and Todorovic (2021) employs CTE Kukleva et al. (2019) for initialization, and explores action-level cues for improvement, which can also be incorporated for boosting the performance of our approach (as mentioned in Sec. 5). Next, TOT outperforms the sequential representation learning and clustering approach of CTE Kukleva et al. (2019) by and on F1-Score and MOF respectively, while performing on par with VTE VidalMata et al. (2021) and the concurrent work of UDE Swetha et al. (2021), e.g., for MOF, TOT achieves while VTE and UDE obtain and respectively. Also, the significant performance gains of TOT over the most relevant competitor CTE confirms the advantages of joint representation learning and clustering. Some qualitative results of CTE and TOT are shown in Fig. 5, where it can be seen that our results are more closely aligned with the ground truth than those of CTE. Further, combining TOT and TCL yields a similar F1-Score but a lower MOF than TOT, which might be due to large viewpoint variations across videos in Breakfast. Finally, our unsupervised approach, i.e., TOT, outperforms a few weakly-supervised methods including D3TW Chang et al. (2019), NN-Viterbi Richard et al. (2018), and GRU Richard et al. (2017), i.e., for MOF, TOT achieves while D3TW, NN-Viterbi, and GRU obtain , , and respectively.

4.5 Results on Desktop Assembly

Approach F1-Score MOF CTE Kukleva et al. (2019) 44.9 47.6 Ours (TOT) 51.7 56.3 Ours (TOT+TCL) 53.4 58.1
Table 5: Results on Desktop Assembly. The best results are in bold. The second best are underlined.

Prior works, e.g., CTE Kukleva et al. (2019) and VTE VidalMata et al. (2021), often exploit temporal information via time-stamp prediction. However, the same action might occur at various time stamps across videos in practice, e.g., different actors might perform the same action at different speeds. Our approach instead leverages temporal cues via temporal optimal transport, which preserves the temporal order of the activity. Tab. 5 shows the results of CTE and our methods (i.e., TOT and TOT+TCL) on Desktop Assembly, where the activity comprises 22 actions conducted in a fixed order. From Tab. 5, TOT+TCL performs the best on both metrics, i.e., for F1-Score and for MOF. Further, TOT and TOT+TCL significantly outperform CTE on both metrics, i.e., TOT and TOT+TCL obtain F1-Score gains of and over CTE respectively, and MOF gains of and over CTE respectively.

5 Conclusion and Future Work

We propose a novel approach for unsupervised activity segmentation, which jointly performs representation learning and online clustering. Our approach leverages video frame clustering as the pretext task, thus directly optimizing for the main task of unsupervised activity segmentation. To exploit temporal information in videos, we make use of a combination of temporal optimal transport and temporal coherence loss. The former preserves the temporal order of the activity when computing pseudo-label cluster assignments, while the latter enforces temporally close frames to be mapped to nearby points in the embedding space and vice versa. Furthermore, our approach is online, processing one mini-batch at a time. We show comparable or superior performance against the state-of-the-art in unsupervised activity segmentation on three public datasets, i.e., 50 Salads, YouTube Instructions, and Breakfast, and our Desktop Assembly dataset, while having substantially less memory requirements.

Although we have mostly focused on exploiting temporal/frame-level cues in videos, our approach is generally applicable to additional self-supervised losses. One direction for our future work would be using additional self-supervised losses, e.g., visual cues VidalMata et al. (2021) or action-level cues Li and Todorovic (2021). As mentioned previously in Sec. 3.2, our temporal order-preserving prior does not hold for activities with permutations. Another direction would be exploring permutation-aware models, e.g., Mallow Sener and Yao (2018).

6 Broader Impact

We discuss an approach for unsupervised activity segmentation. Specifically, given a collection of unlabled videos capturing a human activity, it automatically segments the videos and discovers the actions/sub-activities by grouping frames across all videos into clusters, with each cluster representing one of the actions. The approach enables learning video recognition models without requiring action labels. Our approach would positively impact the problems of worker training and assistance, where models automatically built from video datasets of expert demonstrations in diverse domains, e.g., factory work and medical surgery, could be used to provide training and guidance to new workers. Similarly, there exist problems such as surgery standardization, where Operation Room video datasets could be processed with approaches such as ours to improve the standard of care for patients globally. On the other hand, video understanding algorithms could generally be used in surveillance applications, where they improve security and productivity at the cost of privacy.

References

  • [1] S. N. Aakur and S. Sarkar (2019) A perceptual prediction framework for self supervised event segmentation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1197–1206. Cited by: §4.
  • [2] U. Ahsan, C. Sun, and I. Essa (2018)

    Discrimnet: semi-supervised action recognition from videos using generative adversarial networks

    .
    arXiv preprint arXiv:1801.07230. Cited by: §2.
  • [3] J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583. Cited by: §1, §2, 2nd item, 3rd item, §4.3, Table 4, §4, §4.
  • [4] Y. Asano, C. Rupprecht, and A. Vedaldi (2019) Self-labelling via simultaneous clustering and representation learning. In International Conference on Learning Representations, Cited by: §2, §3.1, §3.2, §3.2.
  • [5] M. Á. Bautista, A. Sanakoyeu, E. Tikhoncheva, and B. Ommer (2016) CliqueCNN: deep unsupervised exemplar learning. In NIPS, Cited by: §2.
  • [6] Y. Bengio and J. S. Bergstra (2009) Slow, decorrelated features for pretraining complex cell-like networks. In Advances in neural information processing systems, pp. 99–107. Cited by: §2.
  • [7] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238. Cited by: §2.
  • [8] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §2.
  • [9] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.
  • [10] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Neural Information Processing Systems, Cited by: §2, §3.1, §3.2, §3.2.
  • [11] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §1.
  • [12] C. Chang, D. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles (2019) D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3555. Cited by: §1, Figure 5, §4.4.
  • [13] Y. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar (2018) Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139. Cited by: §1.
  • [14] M. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira (2020) Action segmentation with joint self-supervised temporal domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9454–9463. Cited by: §1, Figure 5.
  • [15] J. Choi, G. Sharma, S. Schulter, and J. Huang (2020) Shuffle and attend: video domain adaptation. In European Conference on Computer Vision, pp. 678–695. Cited by: §2.
  • [16] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. Advances in neural information processing systems 26, pp. 2292–2300. Cited by: §3.2, §3.2.
  • [17] A. Diba, V. Sharma, L. V. Gool, and R. Stiefelhagen (2019) Dynamonet: dynamic action and motion network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6192–6201. Cited by: §2.
  • [18] L. Ding and C. Xu (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6508–6516. Cited by: §1, Figure 5.
  • [19] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1801–1810. Cited by: §2.
  • [20] Y. A. Farha and J. Gall (2019) Ms-tcn: multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584. Cited by: Figure 5.
  • [21] Z. Feng, C. Xu, and D. Tao (2019) Self-supervised representation learning by rotation feature decoupling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10364–10374. Cited by: §2.
  • [22] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645. Cited by: §2.
  • [23] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes (2019) Predicting the future: a jointly learnt model for action anticipation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5562–5571. Cited by: §2.
  • [24] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2020) Learning representations by predicting bags of visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6928–6938. Cited by: §2.
  • [25] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [26] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun (2015) Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pp. 4086–4093. Cited by: §2, §3.1.
  • [27] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §2, §3.1.
  • [28] T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.
  • [29] S. Haresh, S. Kumar, H. Coskun, S. N. Syed, A. Konin, M. Z. Zia, and Q. Tran (2021) Learning by aligning videos in time. arXiv preprint arXiv:2103.17260. Cited by: §2.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: 4th item.
  • [31] G. E. Hinton and R. S. Zemel (1994) Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, pp. 3–10. Cited by: §2.
  • [32] D. Huang, L. Fei-Fei, and J. C. Niebles (2016) Connectionist temporal modeling for weakly supervised action labeling. In European Conference on Computer Vision, pp. 137–153. Cited by: §1, Figure 5.
  • [33] J. Huang, Q. Dong, S. Gong, and X. Zhu (2019) Unsupervised deep learning by neighbourhood discovery. In

    International Conference on Machine Learning

    ,
    pp. 2849–2858. Cited by: §2.
  • [34] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8545–8552. Cited by: §2.
  • [35] D. Kim, D. Cho, D. Yoo, and I. S. Kweon (2018) Learning image representations by completing damaged jigsaw puzzles. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 793–802. Cited by: §2.
  • [36] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [37] H. Kuehne, A. Arslan, and T. Serre (2014) The language of actions: recovering the syntax and semantics of goal-directed human activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 780–787. Cited by: §1, Figure 5, §4.
  • [38] H. Kuehne, J. Gall, and T. Serre (2016) An end-to-end generative framework for video segmentation and recognition. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. Cited by: §1.
  • [39] H. Kuehne, A. Richard, and J. Gall (2017)

    Weakly supervised learning of actions from transcripts

    .
    Computer Vision and Image Understanding 163, pp. 78–89. Cited by: §1, Figure 5.
  • [40] H. Kuehne, A. Richard, and J. Gall (2018) A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. IEEE transactions on pattern analysis and machine intelligence 42 (4), pp. 765–779. Cited by: Figure 5.
  • [41] A. Kukleva, H. Kuehne, F. Sener, and J. Gall (2019) Unsupervised learning of action classes with continuous temporal embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12066–12074. Cited by: Figure 1, §1, §1, §2, §3.2, Figure 5, Figure 5, §4.1, §4.2, §4.3, §4.4, §4.5, Table 4, Table 4, Table 5, §4, §4, §4.
  • [42] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Learning representations for automatic colorization. In European conference on computer vision, pp. 577–593. Cited by: §2.
  • [43] G. Larsson, M. Maire, and G. Shakhnarovich (2017) Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6874–6883. Cited by: §2.
  • [44] C. Lea, A. Reiter, R. Vidal, and G. D. Hager (2016) Segmental spatiotemporal cnns for fine-grained action segmentation. In European Conference on Computer Vision, pp. 36–52. Cited by: §1.
  • [45] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §2.
  • [46] J. Li, P. Lei, and S. Todorovic (2019) Weakly supervised energy-based learning for action segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6243–6251. Cited by: §1, Figure 5.
  • [47] J. Li and S. Todorovic (2021) Action shuffle alternating learning for unsupervised action segmentation. arXiv preprint arXiv:2104.02116. Cited by: Figure 1, §1, §1, §2, §3.2, Figure 5, §4.2, §4.3, §4.4, Table 4, Table 4, §4, §5.
  • [48] S. Li, Y. AbuFarha, Y. Liu, M. Cheng, and J. Gall (2020) Ms-tcn++: multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [49] X. Liu, J. Van De Weijer, and A. D. Bagdanov (2018) Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7661–7669. Cited by: §2.
  • [50] J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy (2015) What’s cookin’? interpreting cooking videos using text, speech and vision. In HLT-NAACL, Cited by: §1, §2.
  • [51] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §2.
  • [52] H. Mobahi, R. Collobert, and J. Weston (2009) Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. Cited by: §2, §3.1.
  • [53] M. Noroozi, H. Pirsiavash, and P. Favaro (2017) Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5898–5906. Cited by: §2.
  • [54] A. Richard and J. Gall (2016) Temporal action detection using a statistical language model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3131–3140. Cited by: §1, Figure 5.
  • [55] A. Richard, H. Kuehne, and J. Gall (2017) Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 754–763. Cited by: §1, Figure 5, §4.4.
  • [56] A. Richard, H. Kuehne, A. Iqbal, and J. Gall (2018) Neuralnetwork-viterbi: a framework for weakly supervised video learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7386–7395. Cited by: §1, Figure 5, §4.4.
  • [57] F. Sener and A. Yao (2018) Unsupervised learning and segmentation of complex activities from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8368–8376. Cited by: Figure 1, §1, §1, §2, §3.2, Figure 5, §4.3, Table 4, §4, §4, §5.
  • [58] O. Sener, A. R. Zamir, S. Savarese, and A. Saxena (2015) Unsupervised semantic parsing of video collections. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4480–4488. Cited by: §1, §2.
  • [59] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain (2018) Time-contrastive networks: self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §2.
  • [60] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5734–5743. Cited by: §1.
  • [61] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1049–1058. Cited by: §1.
  • [62] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.1.
  • [63] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.
  • [64] S. Stein and S. J. McKenna (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp. 729–738. Cited by: §4.
  • [65] B. Su and G. Hua (2017) Order-preserving wasserstein distance for sequence matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1049–1057. Cited by: §3.2.
  • [66] S. Swetha, H. Kuehne, Y. S. Rawat, and M. Shah (2021) Unsupervised discriminative embedding for sub-action learning in complex activities. arXiv preprint arXiv:2105.00067. Cited by: §2, Figure 5, §4.2, §4.3, §4.4, Table 4, Table 4, §4.
  • [67] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §1.
  • [68] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §1.
  • [69] R. G. VidalMata, W. J. Scheirer, A. Kukleva, D. Cox, and H. Kuehne (2021) Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1238–1247. Cited by: Figure 1, §1, §1, §2, §3.2, Figure 5, §4.2, §4.3, §4.4, §4.5, Table 4, Table 4, §4, §4, §5.
  • [70] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §2.
  • [71] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in neural information processing systems, pp. 613–621. Cited by: §2.
  • [72] H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp. 3551–3558. Cited by: 1st item.
  • [73] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803. Cited by: §1.
  • [74] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §3.1.
  • [75] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    .
    In International conference on machine learning, pp. 478–487. Cited by: §2.
  • [76] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343. Cited by: §2.
  • [77] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan (2020) Clusterfit: improving generalization of visual representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6509–6518. Cited by: §2.
  • [78] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5147–5156. Cited by: §2.
  • [79] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan (2019) Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103. Cited by: §1.
  • [80] C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012. Cited by: §2.
  • [81] W. Y. Zou, A. Y. Ng, and K. Yu (2011) Unsupervised learning of visual invariance with temporal coherence. In NIPS 2011 workshop on deep learning and unsupervised feature learning, Vol. 3. Cited by: §2.
  • [82] W. Zou, S. Zhu, K. Yu, and A. Y. Ng (2012) Deep learning of invariant features via simulated fixations in video. In Advances in neural information processing systems, pp. 3203–3211. Cited by: §2.