Official implementation of the paper: Unsupervised learning of action classes with continuous temporal embedding
The task of temporally detecting and segmenting actions in untrimmed videos has seen an increased attention recently. One problem in this context arises from the need to define and label action boundaries to create annotations for training which is very time and cost intensive. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. To this end, we use a continuous temporal embedding of framewise features to benefit from the sequential nature of activities. Based on the latent space created by the embedding, we identify clusters of temporal segments across all videos that correspond to semantic meaningful action classes. The approach is evaluated on three challenging datasets, namely the Breakfast dataset, YouTube Instructions, and the 50Salads dataset. While previous works assumed that the videos contain the same high level activity, we furthermore show that the proposed approach can also be applied to a more general setting where the content of the videos is unknown.READ FULL TEXT VIEW PDF
Official implementation of the paper: Unsupervised learning of action classes with continuous temporal embedding
The task of action recognition has seen tremendous success over the last years. So far, high-performing approaches require full supervision for training. But acquiring frame-level annotations of actions in untrimmed videos is very expensive and impractical for very large datasets. Recent works, therefore, explore alternative ways of training action recognition approaches without having full frame annotations at training time. Most of those concepts, which are referred to as weakly supervised learning, rely on ordered action sequences which are given for each video in the training set.
Acquiring ordered action lists, however, can also be very time consuming and it assumes that it is already known what actions are present in the videos before starting the annotation process. For some applications like indexing large video datasets or human behavior analysis in neuroscience or medicine, it is often unclear what action should be annotated. It is therefore important to discover actions in large video datasets before deciding which actions are relevant or not. Recent works [27, 1] therefore proposed the task of unsupervised learning of actions in long, untrimmed video sequences. Here, only the videos themselves are used and the goal is to identify clusters of temporal segments across all videos that correspond to semantic meaningful action classes.
In this work we propose a new method for unsupervised learning of actions from long video sequences, which is based on the following contributions. The first contribution is the learning of a continuous temporal embedding of frame-based features. The embedding exploits the fact that some actions need to be performed in a certain order and we use a network to learn an embedding of frame-based features with respect to their relative time in the video. As the second contribution, we propose a decoding of the videos into coherent action segments based on an ordered clustering of the embedded frame-wise video features. To this end, we first compute the order of the clusters with respect to their timestamp. Then a Viterbi decoding approach is used such as in [26, 13, 24, 19]
which maintains an estimate of the most likely activity state given the predefined order.
We evaluate our approach on the Breakfast  and YouTube Instructions datasets , following the evaluation protocols used in [27, 1]. We also conduct experiments on the 50Salads dataset  where the videos are longer and contain more action classes. Our approach outperforms the state-of-the-art in unsupervised learning of action classes from untrimmed videos by a large margin. The evaluation protocol used in previous works, however, divides the datasets into distinct clusters of videos using the ground-truth activity label of each video, i.e., unsupervised learning and evaluation are performed only on videos, which contain the same high level activity. This simplifies the problem since in this case most of the actions occur in all videos.
As a third contribution, we therefore propose an extension of our approach that allows to go beyond the scenario of processing only videos from known activity classes, i.e., we discover semantic action classes from all videos of each dataset at once, in a completely unsupervised way without any knowledge of the related activity. To this end, we learn a continuous temporal embedding for all videos and use the embedding to build a representation for each untrimmed video. After clustering the videos, we identify consistent video segments for all videos within a cluster. In our experiments, we show that the proposed approach not only outperforms the state-of-the-art using the simplified protocol, but it is also capable to learn actions in a completely unsupervised way. Code is available on-line.111https://github.com/annusha/unsup_temp_embed
Action recognition [18, 32, 30, 3, 5] as well as the understanding complex activities [15, 35, 29] has been studied for many years with a focus on fully supervised learning. Recently, there has been an increased interest in methods that can be trained with less supervision. One of the first works in this field has been proposed by Laptev et al.  where the authors learn actions from movie scripts. Another dataset that follows the idea of using subtitles has been proposed by Alayrac et al. , also using YouTube videos to automatically learn actions from instructional videos. A multi-modal version of this idea has been proposed by . Here, the authors also collected cooking videos from YouTube and used a combination of subtitles, audio, and vision to identify receipt steps in videos. Another way of learning from subtitles is proposed by Sener et al.  by representing each frame via the occurrence of actions atoms given the visual comments at this point. These works, however, assume that the narrative text is well-aligned with the visual data. Another form of weak supervision are video transcripts [12, 17, 24, 7, 26], which provide the order of actions but that are not aligned with the videos, or video tags [33, 25].
There are also efforts for unsupervised learning of action classes. One of the first works that was tackling the problem of human motion segmentation without training data was proposed by Guerra-Filho and Aloimonos . They propose a basic segmentation with subsequent clustering based on sensory-motor data. Based on those representations, they propose the application of a parallel synchronous grammar system to learn atomic action representations similar to words in language. Another work in this context is proposed by Fox et al.  where a Bayesian nonparametric approach helps to jointly model multiple related time series without further supervision. They apply their work on motion capture data.
In the context of video data, the temporal structure of video data has been exploited to fine-tune networks on training data without labels [34, 2]. The temporal ordering of video frames has also been used to learn feature representations for action recognition [20, 23, 9, 4]. Lee et al.  learn a video representation in an unsupervised manner by solving a sequence sorting problem. Ramanathan et al.  build their temporal embedding by leveraging contextual information of each frame on different resolution levels. Fernando et al.  presented a method to capture the temporal evolution of actions based on frame appearance by learning a frame ranking function per video. In this way, they obtain a compact latent space for each video separately. A similar approach to learn a structured representation of postures and their temporal development was proposed by Milbich et al. . While these approaches address different tasks, Sener et al.  proposed an unsupervised approach for learning action classes. They introduced an iterative approach which alternates between discriminative learning of the appearance of sub-activities from visual features and generative modeling of the temporal structure of sub-activities using a Generalized Mallows Model.
As input we are given a set of videos and each video is represented by framewise features. The task is then to estimate the subaction label for each video frame . Following the protocol of [1, 27], we define the number of possible subactions separately for each activity as the maximum number of possible subactions as they occur in the ground-truth. The values of are provided in the supplementary material.
Fig. 1 provides an overview of our approach for unsupervised learning of actions from long video sequences. First, we learn an embedding of all features with respect to their relative time stamp as described in Sec. 3.2. The resulting embedded features are then clustered and the mean temporal occurrence of each cluster is computed. This step, as well as the temporal ordering of the clusters is described in Sec. 3.3. Each video is then decoded with respect to this ordering given the overall proximity of each frame to each cluster as described in Sec. 3.4.
We also present an extension to a more general protocol, where the videos have a higher diversity. Instead of assuming as in [1, 27] that the videos contain the same high-level activity, we discuss the completely unsupervised case in Sec. 3.5. We finally introduce a background model to address background segments in Sec. 3.6.
The idea of learning a continuous temporal embedding relies on the assumption that similar subactions tend to appear at a similar temporal range within a complex activity. For instance a subaction like “take cup” will usually occur in the beginning of the activity “making coffee”. After that people probably pour coffee into the mug and finally stir coffee. Thus many subactions that are executed to conduct a specific activity are softly bound to their temporal position within the video.
To capture the combination of visual appearance and temporal consistency, we model a continuous latent space by capturing simultaneously relative time dependencies and the visual representation of the frames. For the embedding, we train a network architecture which optimizes the embedding of all framewise features of an activity with respect to their relative time . As shown in Fig. 1, we take an MLP with two hidden layers with dimensionality and
, respectively, and logistic activation functions. As loss, we use the mean squared error between the predicted time stamp and the true time stampof the feature. The embedding is then given by the second hidden layer.
After the embedding, the features of all videos are clustered into
clusters by k-Means. Since in Sec.3.4 we need the probability , i.e., the probability that the embedded feature belongs to cluster , we estimate a
-dimensional Gaussian distribution for each cluster:
Note that this clustering does not define any specific ordering. To order clusters with respect to their temporal occurrence, we compute the mean over time stamps of all frames belonging to each cluster
The clusters are then ordered with respect to the time stamp so that is the set of ordered cluster labels subject to . The resulting ordering is then used for the decoding of each video.
We finally temporally segment each video separately, i.e., we assign each frame to one of the ordered clusters . We first calculate the probability of each frame that it belongs to cluster as defined by (1). Based on the cluster probabilities for the given video, we want to maximize the probability of the sequence following the order of the clusters to get consistent assignments for each frame of the video:
where is the probability that belongs to the cluster , and are the transition probabilities of moving from the label at frame to the next label at frame ,
This means that we allow either a transition to the next cluster in the ordered cluster list or we keep the cluster assignment of the previous frame. Note that (3) can be solved efficiently using a Viterbi algorithm.
So far we discussed the case of applying unsupervised learning to a set of videos that all belong to the same activity. When moving to a larger set of videos without any knowledge of the activity class, the assumption of sharing the same subactions within the collection cannot be applied anymore. As it is illustrated in Fig. 2, we therefore cluster the videos first into more consistent video subsets.
Similar to the previous setting, we learn a -dimensional embedding of the features but the embedding is not restricted to a subset of the training data, but it is computed for the whole dataset at once. Afterward, the embedded features are clustered in this space to build a video representation based on bag-of-words using quantization with a soft assignment. In this way, we obtain a single bag-of-words feature vector per video sequence. Using this representation, we cluster the videos into video sets. For each video set, we then separately infer clusters for subactions and assign them to each video frame as in Fig. 1. However, we do not learn an embedding for each video set but use the embedding learned on the entire dataset for each video set. The impact of and will be evaluated in the experimental section.
As subactions are not always executed continuously and without interruption, we also address the problem of modeling a background class. In order to decide if a frame should be assigned to one of the clusters or the background, we introduce a parameter which defines the percentage of features that should be assigned to the background. To this end, we keep only percent of the points within each cluster that are closest to the cluster center and add the other features to the background class. For the labeling described in Sec. 3.4, we remove all frames that have been already assigned to the background before estimating (3), i.e., the background frames are first labeled and the remaining frames are then assigned to the ordered clusters .
The Breakfast dataset is a large-scale dataset that comprises ten different complex activities of performing common kitchen activities with approximately eight subactions per activity class. The duration of the videos varies significantly, e.g. coffee has an average duration of 30 seconds while cooking pancake takes roughly 5 minutes. Also in regards to the subactivity ordering, there are considerable variations. For evaluation, we use reduced Fisher Vector features as proposed by  and used in  and we follow the protocol of , if not mentioned otherwise.
The YouTube Instructions dataset contains 150 videos from YouTube with an average length of about two minutes per video. There are five primary tasks: making coffee, changing car tire, cpr, jumping car, potting a plant. The main difference with respect to the Breakfast dataset is the presence of a background class. The fraction of background within different tasks varies from 46% to 83%. We use the original precomputed features provided by  and used by .
The 50Salads dataset contains 4.5 hours of different people performing a single complex activity, making mixed salad. Compared to the other datasets, the videos are much longer with an average video length of 10k frames. We perform evaluation on two different action granularity levels proposed by the authors: mid-level with 17 subaction classes and eval-level with 9 subaction classes.
Since the output of the model consists of temporal subaction bounds without any particular correspondences to ground-truth labels, we need a one-to-one mapping between and the ground-truth labels to evaluate and compare the method. Following  and 
, we use the Hungarian algorithm to get a one-to-one matching and report accuracy as the mean over frames (MoF) for the Breakfast and 50Salads datasets. Note that especially MoF is not always suitable for imbalanced datasets. We therefore also report the Jaccard index as intersection over union (IoU) as an additional measurement. For the YouTube Instruction dataset, we also report the F1-score since it is used in previous works. Precision and recall are computed by evaluating if the time interval of a segmentation falls inside the corresponding ground-truth interval. To check if a segmentation matches a time interval, we randomly draw
frames of the segments. The detection is considered correct if at least half of the frames match the respective class, and incorrect otherwise. Precision and recall are computed for all videos and F1 score is computed as the harmonic mean of precision and recall.
. First, we analyze the impact of the proposed temporal embedding by comparing the proposed method to other embedding strategies as well as to different feature types without embedding on the Breakfast dataset. As features we consider AlexNet fc6 features, pre-trained on ImageNet as used in, I3D features  based on the RGB and flow pipeline, and pre-computed dense trajectories . We further compare with previous works with a focus on learning the temporal embedding [23, 9]. We trained these models following the settings of each paper and construct the latent space, which is used to substitute ours.
As can be seen in Table 1, the results with the continuous temporal embedding are clearly outperforming all the above-mentioned approaches with and without temporal embedding. We also used OPN  to learn an embedding, which is then used in our approach. However, we observed that for long videos nearly all frames where assigned to a single cluster. When we exclude the long videos with degenerated results, the MoF was lower compared to our approach.
|Comp. of temporal embedding strategies|
|ImageNet  + proposed|
|I3D  + proposed|
|dense trajectories  + proposed|
|video vector  + proposed|
|video darwin  + proposed|
We compare our approach, which uses Viterbi decoding, with the Mallow model decoding that has been proposed in . The authors propose a rankloss embedding over all video frames from the same activity with respect to a pseudo ground-truth subaction annotation. The embedded frames of the whole activity set are then clustered and the likelihood for each frame and for each cluster is computed. For the decoding, the authors build a histogram of features with respect to their clusters with a hard assignment and set the length of each action with respect to the overall amount of features per bin. After that, they apply a Mallow model to sample different orderings for each video with respect to the sampled distribution. The resulting model is a combination of Mallow model sampling and action length estimation based on the frame distribution.
For the first experiment, we evaluated the impact of the different decoding strategies with respect to the proposed embedding. In Table 2 we compare the results of decoding with the Mallow model only, Viterbi only, and a combination of Mallow-Viterbi decoding. For the combination, we first sample the Mallow ordering as described by  leading to an alternative ordering. We then apply a Viterbi decoding to the new as well as to the original ordering and choose the sequence with the higher probability. It shows that the original combination of Mallow model and multinomial distribution sampling performs worst on the temporal embedding. Also, the combination of Viterbi and Mallow model can not outperform the Viterbi decoding alone. To have a closer look, we visualize the observation probabilities as well as the resulting decoding path over time for two videos in Fig. 3. It shows that the decoding, which is always given the full sequence of subactions, is able to marginalize subactions that do not occur in the video by just assigning only very few frames to those ones and the majority of frames to the clusters that occur in the video. This means that the effect of marginalization allows to discard subactions that do not occur. Overall, it turns out that this strategy of marginalization usually performs better than re-ordering the resulting subaction sequence as done by the Mallow model. To further compare the proposed setup to , we additionally compare the impact of different decoding strategies, Mallow model and Viterbi, with respect to the two embeddings, rankloss  and continuous temporal embedding, in Table 3. It shows that the rankloss embedding works well in combination with the multinomial Mallow model, but fails when combined with Viterbi decoding because of the missing temporal prior in this case, whereas the Mallow model is not able to decode sequences in the continuous temporal embedding space. This shows the necessity of a suitable combination of both, the embedding and the decoding strategy.
|Mallow vs. Viterbi|
|Comparison with rankloss and Mallow model|
|Rankloss emb.||Temp. emb.|
|Mallow model (MoF)|
|Viterbi dec. (MoF)|
Finally, we assess the impact of the proposed background model for the given setting. For this evaluation, we choose the YouTube Instructions dataset. Note that for this dataset, two different evaluation protocols have been proposed so far.  evaluates results on the YTI dataset usually without any background frames, which means that during evaluation, only frames with a class label are considered and all background frames are ignored. Note that in this case it is not penalized if estimated subactions become very long and cover the background. Including background for a dataset with a high background portion, however, leads to the problem that a high MoF accuracy is achieved by labeling most frames as background. We therefore introduce for this evaluation the Jaccard index as intersection over union (IoU) as additional measurement, which is also common in comparable weak learning scenarios . For the following evaluation, we vary the ratio of desired background frames as described in Sec. 3.6 from to and show the results in Fig. 4. As can be seen, a smaller background ratio leads to better results when computing MoF without background, whereas a higher background ratio leads to better results when the background is considered in the evaluation. When we compare it to the IoU with and without background, it shows that the IoU without background suffers from the same problems as the MoF in this case, but that the IoU with background gives a good measure considering the trade-off between background and class labels. For of , our approach achieves and IoU with and without background, respectively, and and MoF with and without background, respectively.
|HTK+DTF w. PCA |
We further compare the proposed approach to current state-of-the-art approaches, considering unsupervised learning setups as well as weakly and fully supervised approaches for both datasets. However, even though evaluation metrics are directly comparable to weakly and fully supervised approaches, one needs to consider that the results of the unsupervised learning are reported with respect to an optimal assignment of clusters to ground-truth classes and therefore report the best possible scenario for this task.
We compare our approach to recent works on the Breakfast dataset in Table 4. As already discussed in Sec. 4.4, our approach outperforms the current state-of-the-art for unsupervised learning on this dataset by . But it also shows that the resulting segmentation is comparable to the results gained by the best weakly supervised system so far  and outperforms all other recent works in this field. In the case of YouTube Instructions, we compare to the approaches of  and  for the case of unsupervised learning only. Note that we follow their protocol and report the accuracy of our system without considering background frames. Here, our approach again outperforms both recent methods with respect to Mof as well as F1-score. A qualitative example of the segmentation on both datasets is given in Fig. 5.
Although we cannot compare with other unsupervised methods on the 50Salads dataset, we compare our approach with the state-of-the-art for weakly and fully supervised learning in Table 6. Each video in this dataset has a different order of subactions and includes many repetitions of the subactions. This makes unsupervised learning very difficult compared to weakly or fully supervised learning. Nevertheless, and MoF accuracy are still competitive results for an unsupervised method.
|Fully supervised ||eval|
|Fully supervised ||mid|
|Weakly supervised ||mid|
Finally, we assess the performance of our approach with respect to a complete unsupervised setting as described in Sec. 3.5. Thus, no activity classes are given and all videos are processed together. For the evaluation, we again perform matching by the Hungarian method and match all subactions independent of their video cluster to all possible action labels. In the following, we conduct all experiments on the Breakfast dataset and report MoF accuracy unless stated otherwise. We assume in case of Breakfast activity clusters with
subactions per cluster, we then match 50 different subaction clusters to 48 ground-truth subaction classes, whereas the frames of the leftover clusters are set to background. For the evaluation of the activity clusters, we perform the Hungarian matching on activity level as described earlier.Activity-level clustering. We first evaluate the correctness of the resulting activity clusters with respect to the proposed bag-of-words clustering. We therefore evaluate the accuracy of the completely unsupervised pipeline with and without bag-of-words clustering, as well as the case of hard and soft assignment. As can be seen in Table 7, omitting the quantization step significantly reduces the overall accuracy of the video-based clustering.
|Accuracy of activity clustering|
|mean over videos|
|BoW hard ass.|
|BoW soft ass.|
Influence of additional embedding. We also evaluate the impact of learning only one embedding for the entire dataset as in Fig. 2 or learning additional embeddings for each video set. The results in Table 8 show that a single embedding learned on the entire dataset achieves 18.3% MoF accuracy. If we learn additional embeddings for each of the video clusters, the accuracy even slightly drops. For completeness, we also compare our approach to a very simple baseline, which uses k-Means clustering with clusters using the video features without any embedding. This baseline achieves only MoF accuracy. This shows that a single embedding learned on the entire dataset performs best.
|full w add. cluster emb.|
|full w/o add. cluster emb.|
Influence of cluster size. For all previous evaluations, we approximated the cluster sizes based on the ground-truth number of classes. We therefore evaluate how the overall ratio of activity and subaction clusters influences the overall performance. To this end, we fix the overall number of final subaction clusters to 50 to allow mapping to the 48 ground-truth subaction classes and vary the ratio of activity () to subaction () clusters. Table 9 shows the influence of the various cluster sizes. It shows that omitting the activity clustering (), leads to significantly worse results. Depending on the measure, good results are achieved for and .
|Influence of cluster size|
|/||mean over videos||MoF||IoU|
|1 / 50|
|2 / 25|
|3 / 16|
|5 / 10|
|10 / 5|
Unsupervised learning on YouTube Instructions. Finally, we evaluate the accuracy for the completely unsupervised learning setting on the YouTube Instructions dataset in Table 10. We use and and follow the protocol described in Sec. 4.5, i.e., we report the accuracy with respect to the parameter as MoF and IoU with and without background frames. As we already observed in Fig. 4, IoU with background frames is the only reliable measure in this case since the other measures are optimized by declaring all or none of the frames as background. Overall we observe a good trade-off between background and class labels for .
|Influence of background ratio|
|wo bg.||w bg.||wo bg.||w bg.|
We proposed a new method for the unsupervised learning of actions in sequential video data. Given the idea that actions are not performed in an arbitrary order and thus bound to their temporal location in a sequence, we propose a continuous temporal embedding to enforce clusters at similar temporal stages. We combine the temporal embedding with a frame-to-cluster assignment based on Viterbi decoding which outperforms all other approaches in the field. Additionally, we introduced the task of unsupervised learning without any given activity classes, which is not addressed by any other method in the field so far. We show that the proposed approach also works on this less restricted, but more realistic task.
Acknowledgment. The work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – GA 1927/4-1 (FOR 2535 Anticipating Human Behavior), KU 3396/2-1 (Hierarchical Models for Action Recognition and Analysis in Video Data), and the ERC Starting Grant ARCA (677650). This work has been supported by the AWS Cloud Credits for Research program.
International Journal of Computer Vision, 126(2):358–374, 2018.
Imagenet classification with deep convolutional neural networks.In NIPS. 2012.