In the past few decades, with an increasing ubiquity and accessibility of video records, a significant part of research has been devoted to analysing human behaviour in video, including analysing human actions. This active area of research has vast applications, such as health-care, surveillance, video retrieval, entertainment, robotics and human-computer interaction. At its heart, action recognition has evolved from traditional hand-crafted features [laptev2005space, wang2013action, wang2013motionlets]
to deep learning based approaches[karpathy2014large, simonyan2014two, carreira2017quo, girdhar2017actionvlad] and achieved remarkable results. However, recent works have focused on proposing network architectures to deal with the spatio-temporal input, and neglected to explore fake or incomplete action instances.
In this paper, we focus on incomplete actions, whether intentional or accidental, which could be crucial in contexts such as surveillance and health-care applications. These are actions which are attempted but their goals remain incomplete. Such incomplete sequences could be incorrectly recognised, or localised, by current state-of-the-art methods. As an example, consider an incomplete pick
, where the subject only pretends to pick an object up. Standard action recognition classifiers would identify this as apick action, because of its similar motion to successfully completed picks. Likewise, a patient picking up a medicine tablet, but not ingesting it, would also be incorrectly recognised as “take medicine” action, posing risks to automatic monitoring of their health.
Action completion was first introduced in [Heidari2016] to assess whether the action’s goal is achieved. The approach outputs sequence-level predictions of completion to distinguish complete sequences from incomplete ones. Subsequent works [Heidari2018, becattini2017am] proposed finer-grained analysis of an action’s progression towards completion. For example, [Heidari2018] looks for visual clues which confirm the goal’s completion and detects the completion moment from frame-level pre- and post-completion labels.
In this work, we also investigate completion moment detection. However, we differ from [Heidari2018] in the supervision by which our method learns to detect completion. Frame-level annotations are not only expensive to collect, but importantly, highly subjective and often noisy [moltisanti2017trespassing, sigurdsson2016much, moltisanti19action]. We offer the first attempt to completion moment detection with weak supervision, i.e. using only sequence-level complete and incomplete labels. Fig 1 illustrates frame-level and sequence-level labels for a complete pick action. Given weak labels, we show that completion moment detection could be achieved, by learning temporal attention.
|Frame-level pre- and post-completion labels||Sequence-level complete label|
We propose to use convolutional and recurrent cells with learnt temporal attention, and accumulate evidence for completion from all frames along the sequence, where evidence is weighted according to the frame’s importance to the completion moment prediction. A similar approach was attempted in [Heidari2018], however with full supervision and without temporal attention, where all frames contributed equally to the completion moment detection. We show that our proposed approach outperforms [Heidari2018] when fully supervised, but importantly is also able to detect completion using weak video-level supervision.
We evaluate our approach on selected actions from HMDB [HMDB], UCF101 [UCF101] and RGBD-AC [Heidari2016]. We show that learning temporal attention decreases the completion detection error, i.e. the relative distance between the predicted and ground truth completion moment, by 15% of the sequence length with weak supervision and by 3% when fully supervised.
2 Related Work
In this section, we differentiate our work from approaches that attempted moment detection in actions, including for action completion. We also review works that utilised temporal attention learning, including for action localisation.
Moment Detection: Temporal action detection from untrimmed videos [xu2017r, shou2016temporal, chao2018rethinking] involves localising start and end points of actions. These works assume all actions are successfully completed, and do not consider incomplete attempts. A few methods [becattini2017am, hoai2014max, dwibedi2019temporal, yeung2016glimpses, ma2016learning], on the other hand, have adopted approaches that model action progression or detect particular key moments within actions. Ma et al. [ma2016learning] detect an action by learning its progression through time. They devise a loss which maximises the margin between the correct action class and other classes as the action progresses further. Hoai and De la Torre [hoai2014max]
also detect actions in untrimmed sequences, where the action progression is modelled by a score function, learned using a Support Vector Machine classifier that peaks when the action ends. Similarly, Becattini et al.[becattini2017am] attempt to recognise actions by modelling their evolution through time, where the progress is assumed to be linear, reaching the highest at the end of the sequence. Dwibedi et al. [dwibedi2019temporal] propose a self-supervised approach to learn temporal alignment between sequences based on the similarity between their frames and then observe an action’s progression between key frames given the learnt alignment. Yeung et al. [yeung2016glimpses] detect actions by looking at individual frames through the sequence, where the location of the next input frame is predicted relative to the current frame. Although these works present a fine-grained analysis from the action progression, they also consider complete attempts and do not detect or localise the completion moment.
Action completion [Heidari2016] differs from these works, as it focuses on the action’s goal. In [Heidari2018], completion moment detection was addressed using a classification-regression network which outputs frame-level predictions. These predictions are accumulated, using voting, to detect the completion moment. However, the method is fully supervised, requiring the completion moment annotations for training. In contrast, we solve the same problem using only sequence-level complete and incomplete labels (i.e. weak labels), through utilising temporal attention learning.
has proven beneficial for research problems, such as image captioning[xu2015show, yao2015describing, chen2017sca], object detection and tracking [caicedo2015semantic, denil2012learning, ba2014multiple] and person re-identification [haque2016recurrent, li2018diversity, xu2018attention]. Recently action recognition and localisation have also used attention networks to learn which spatial and/or temporal regions contain the most discriminative information. While some works [sharma2015action, girdhar2017attentional, li2018videolstm, wang2016hierarchical] have only focused on frame-level attention (spatial and motion), many others [du2018recurrent, song2017end, li2019unified, pei2017temporal, yeung2018every, wang2017untrimmednets, paul2018w, nguyen2018weakly] have also incorporated temporal attention in their models. They learn attention scores on the temporal dimension which are then used to weight the frames according to their importance to the final prediction. Of these, Pei et al. [pei2017temporal] introduce a recurrent unit for sequence classification in which a high attention score at each time step pushes the network to focus on the current observations rather than the past ones. Song et al [song2017end] use LSTM for learning temporal attention from skeleton data in action recognition. Du et al. [du2018recurrent] also propose an approach for action recognition using an LSTM with temporal softmax normalisation. Weighted observations, through learnt attention, from all frames in the sequence are combined to recognise the current frame’s ongoing action.
For action localisation with weak supervision, several approaches have also attempted learning temporal attention, such as [wang2017untrimmednets, paul2018w, nguyen2018weakly, li2019unified, yeung2018every]. For example, Yeung et al. [yeung2018every] learn temporal attention for dense labelling in action localisation. Since they use trimmed sequences for training, but apply the learnt attention to localise actions in untrimmed sequences, their detection is considered weakly supervised. Other works, however, use untrimmed sequences in training. Li et al. [li2019unified] apply attention for action recognition and action detection in untrimmed sequences, using features from multiple modalities as the input to the temporal attention LSTM before softmax normalisation. Nguyen et al. [nguyen2018weakly]
learn attention for action classification. They normalise the attention scores by a sigmoid function, and then use these to estimate the discriminative class-specific temporal regions for localising actions. Wang et al.[wang2017untrimmednets] predict the action’s temporal extents by combining hard and soft selection methods, where the soft selection relies on the attention weights for the clip proposals sampled from the untrimmed sequences. In [paul2018w], the attention scores are first predicted as a temporal softmax on the class-wise activations and used during training. They then apply a threshold on class-wise activations for localising actions.
In our method, we also use an LSTM for learning attention and a temporal softmax for its normalisation. However, our method differs not only in the problem of completion moment detection, but in how we accumulate evidence from all frames in the sequence based on learnt attention. We localise the completion moment within trimmed sequences for both training and evaluation.
3 Temporal Attention for Completion Moment Detection
We now present our approach to weakly-supervised completion moment detection, when video-level annotations only are present. Assume are the frames in a sequence of length where an action has been attempted, and is the binary video-level label, indicating whether the attempt has been successfully completed or not. Our method takes as input both complete and incomplete sequences of the same action.
To predict the completion moment
with weak supervision, we propose a network architecture that contains a convolutional frame-level feature extracting network, followed by two recurrent cells for completion prediction and temporal attention prediction, trained jointly with a cross-entropy loss function. Fig2 depicts our architecture, showing the per-frame feature extraction and recurrent nodes (left) along with the training loss (top left). The frame-level predictions are then accumulated (right) to infer the completion moment.
For feature extraction, we train a convolutional network, by propagating the video-level label to all frames in a video, and optimise it using the cross-entropy loss,
where is the number of sequences. The loss is optimised over all frames in all sequences, comparing the video-level labels against classification outputs , while frame-level features are accordingly trained. These learnt features form a good base for completion moment detection, to be refined by the recurrent cells. This is based on the realistic assumption that, up to the completion moment, both complete and incomplete sequences are indistinguishable. However, after completion, there are appearance distinctions between the frames, to signify completion.
We train two recurrent models, namely LSTMs, jointly, one for temporal attention, i.e. to learn the relevance of each frame to completion moment detection, , and one to predict temporally-evidenced completion scores, . The temporal attention network is a standard LSTM, taking the features as input - note that we simplified the notation to as the LSTM is trained and evaluated on one sequence. We compute the attention scores by applying a softmax function to the output nodes of this LSTM, , across the temporal dimension, such that
The second LSTM, also takes the same input , and its output is then combined with the attention to produce completion scores per frame
The scores are the confidence of observing completion at frame . In other words, a frame with a high has observed distinctive signatures for completion, making it more confident that the sequence has been completed, with reflecting the confidence for incompletion. We use these frame-level predictions to compute the completion moment, such that
The predicted completion moment is one where the score for completion beyond frame as well as the score for incompletion before frame are the maximum.
During training, only video-level labels are available, and the ground-truth completion moment is unknown. We thus train for sequence level prediction, such that
where indicates whether the sequence has been completed, somewhere along its frames. These predictions are optimised against the video-level completion labels, for all sequences. Note that, using in the training loss makes the model learn to weight highly the temporal regions which contain discriminative evidence for completion.
While focusing on weakly-supervised completion moment detection, we also evaluate our proposed architecture in a supervised approach. We similarly combine completion detection with temporal attention, when supervision for the completion moment is available. We thus train the output of the confidence scores in the same way as the regression-based supervision in [Heidari2018], using the relative distance between the frame and the ground-truth completion moment , allowing the approaches to be directly comparable. The sequence-level loss would then be:
Using these scores,
estimates the completion moment from each frame, weighted by the learnt attention scores. The sequence-level completion moment is finally predicted as
Fig. 3 illustrates the supervised completion detection where the frame-level evidences are accumulated across the sequence during inference.
4 Experimental Results
Dataset and Implementation Details – We evaluate our approach on the 16 actions used in [Heidari2018] as the only prior work to attempt completion moment detection, and using the publicly available annotations provided by [Heidari2018]. These actions have been collected from three public datasets: HMDB [HMDB], UCF101 [UCF101] and RGBD-AC [Heidari2016]. As stated in [Heidari2018], these actions cover sport-based and daily actions, for which completion can be defined, and include both complete and incomplete sequences for training. We report results on all 16 actions when supervised. However, in the weakly supervised setting, we require sufficient incomplete sequences per action to be able to train with only video-level weak labels. Of these 16 actions, we only evaluate on 10 actions which have both complete and incomplete sequences, while the remaining 6 have less than 5% incomplete sequences.
For feature extraction, we used the spatial stream of VGG-16 architecture, pre-trained on UCF101. We then fine-tuned it for 20 epochs to acquire frame-level features. The learning rate was started at, divided by 10 at epochs 3 and 5. The features were extracted from the output of the layer. Both LSTM cells (attention and completion moment prediction) had a single layer with 128 hidden units. When fully supervised, we first trained the completion prediction LSTM for 10 epochs for stability, then jointly trained both LSTMs for 5 more epochs. When weakly-supervised, we initialised both LSTMs from random and trained them jointly for 10 epochs. The learning rates for the LSTM training in both approaches was for the first 5 epochs and then was divided by 10 for the rest. For temporal prediction, we normalised the sequences to a fixed length, equal to the minimum length of any sequence in that action. Note that our method is not dependent on the sequence length and thus is robust to any other pre-specified length. Additionally, the attention scores were normalised between zero and one and those less than 0.5 were truncated to 0 during inference .
Evaluation Metrics – As in [Heidari2018], we report the accuracy as the average percentage of frames that are correctly labeled into pre- and post-completion, given the ground-truth and the predicted completion moment , such that
We also report as the relative distance between the predicted and ground truth completion moment, averaged on all sequences.
Qualitative results, using supervised learning. Top left to bottom right: UCF101-pole vault, RGBD-AC-open, HMDB-throw and UCF101-basketball, respectively.
Weakly Supervised Completion Detection – Table 1 shows the results of our proposed method for weakly supervised completion moment detection using uniform attention (WS-U) as well as with learnt temporal attention (WS-Att). In WS-U, we do not learn attention, and use uniform weighting in inference. Learning temporal attention improves results for all actions and both metrics. For actions with a smaller percentage of incomplete sequences, i.e. HMDB-pick, UCF101-basketball and UCF101-soccer penalty, the performance is lower for both metrics, though temporal attention consistently improves the results. In total, i.e. on all sequences from the three datasets, RD drops to 0.31 with WS-Att.
We also present some qualitative results for the weakly supervised approach in Fig. 4 where the first bar depicts the completion scores - with green and blue representing the observed evidence for completion and incompletion respectively. The attention is shown in red, and results in orange and purple represent pre and post-completion labels, respectively. In the first two sequences, the temporal attention significantly improves the results by correctly weighting the frames after completion where discriminative featuers are observed. In the third example from action blowing candles, while WS-U has been misled by the completion scores at the end of the sequence, WS-Att correctly detected no completion. The last sequence shows a failure case for action pick. We believe this would be improved with more incomplete sequences during training.
Supervised Temporal Attention Learning – Table 2 shows the results of the supervised approach, compared to the R-R method in [Heidari2018], which is comparable to ours as it does not use frame-level pre/post-completion classification, but directly predicts the completion moment. We also compare uniform weighting (S-U) to the learnt attention (S-Att). Learning temporal attention outperforms uniform weighting on all 16 actions, and outperforms the baseline on 14 out of the 16 actions. In total, RD drops to 0.12 with S-Att.
We present qualitative results for our method, when supervised, in Fig 5. The first bar represents the frame-level regression error, i.e. (darker is lower error). The examples of success (left) show two sequences from actions UCF101-pole vault and RGBD-AC-open. Temporal attention improves the completion moment detection for both complete (top left) and incomplete (bottom left) sequences, as high attention correctly aligns to regions with small prediction error. The examples of failure (right) represent two sequences from actions HMDB-throw and UCF101-basketball, where attention has not been able to pick the regions with small error. In the basketball example, the sequence is detected as complete with and without attention, despite being incomplete.
Frame-level Analysis – We plot the frame-level errors, as well as the attention scores, averaged for all actions, on the three datasets in Fig. 6. The figure shows lower prediction errors (blue), both before and after the completion moment, in two clear minimas. Increased confusion around the completion moment comes from the very similar features before completion moment. We also show the learnt temporal attention for both supervised (red) and weakly supervised (orange) approaches. Generally, higher attention corresponds to lower prediction error - signifying that these frames will have a higher impact in the overall completion moment prediction. When weakly-supervised, the attention scores are comparable to full-supervision though understandably softer attention is learnt.
5 Conclusion and Future Work
In this paper, we proposed a method to detect the completion moment in a variety of actions, suitable for both weakly-supervised and fully supervised sequences. In weak-supervision, video-level labels of completion or incompletion are only required, for the same action. When a sufficient number of incomplete sequences is available during training, our approach, (1) learns discriminative features for frames pre- and post- completion, by propagating video-level labels to individual frames, (2) learns temporal attention, to weight discriminative frame-level features, and then (3) accumulates evidence for completion, weighted by the learnt attention, from all frames to predict the completion moment, or identify the attempt as incomplete. We evaluated our approach on 16 actions (with full supervision) and 10 actions (with weak supervision), from 3 datasets. When weakly-supervised, learning attention significantly improved the results on all tested actions. Under full supervision, we outperform prior work [Heidari2018] on 14 out of the 16 actions.
For future work, we aim to augment the temporal attention with within-frame spatial attention to learn image regions that are most discriminative for completion. We will also combine our soft attention with hard attention mechanisms, similar to [wang2017untrimmednets]. Further, we will investigate completion moment detection from untrimmed sequences, which contain multiple action instances.
The 1st author wishes to thank the University of Bristol for partial funding of her studies. Public datasets were used in this work.