Action Completion: A Temporal Model for Moment Detection

05/17/2018 ∙ by Farnoosh Heidarivincheh, et al. ∙ 0

We introduce completion moment detection for actions - the problem of locating the moment of completion, when the action's goal is confidently considered achieved. The paper proposes a joint classification-regression recurrent model that predicts completion from a given frame, and then integrates frame-level contributions to detect sequence-level completion moment. We introduce a recurrent voting node that predicts the frame's relative position of the completion moment by either classification or regression. The method is also capable of detecting incompletion. For example, the method is capable of detecting a missed ball-catch, as well as the moment at which the ball is safely caught. We test the method on 16 actions from three public datasets, covering sports as well as daily actions. Results show that when combining contributions from frames prior to the completion moment as well as frames post completion, the completion moment is detected within one second in 89

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An action, based on the Oxford Dictionary, is the fact or process of doing something, typically to achieve an aim. Previous works on action recognition from visual data, such as [Simonyan and Zisserman(2014), Ji et al.(2013)Ji, Xu, Yang, and Yu, Donahue et al.(2015)Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, and Darrell], have overlooked assessing whether the action’s aim has actually been achieved, rather than merely attempted. The closely related action localisation problem, e.g. in [Yeung et al.(2016)Yeung, Russakovsky, Mori, and Fei-Fei, Gkioxari and Malik(2015), Hoai and De la Torre(2014), Tian et al.(2013)Tian, Sukthankar, and Shah], predicts the temporal start and end of an action’s attempt, without assessing whether the aim has been achieved either. The notion of assessing an action’s completion was introduced in [Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen], with follow-up works [Becattini et al.(2018)Becattini, Uricchio, Ballan, Seidenari, and Del Bimbo, Farha et al.(2018)Farha, Richard, and Gall] that focus on measuring the action’s progress under a linear assumption, or predicting the time till the next action. In this work, we attempt to detect (or locate) the moment in time when the action can indeed be considered completed.

We define the problem of completion moment detection as detecting the frame that separates pre-completion from post-completion per sequence, when present. Note that the completion moment is different from the typical ‘start’/‘end’ frames in action localisation. The former focuses on the action’s goal, while the latter separates the motion relevant to the action from other actions or background frames. For example, in action ‘drink’, the start of the action for localisation tends to be when a glass is lifted for drinking, and the end is when it is placed down. Conversely, the completion moment we are after, is when the person consumes part of the beverage, marking their goal of drinking being achieved. The subtle nature of this completion moment thus requires a framework that is capable of robust moment detection.

Moment detection, including action completion moment detection, has potential applications in robot-human collaboration, health-care or assisted-living, where an agent can react to a human completing the goal or conversely, failing to complete the action. For example, switching the oven off, could trigger safety alarms.

In detecting the moment of completion, we take a supervised approach, where for training sequences, the completion moment is labeled when present (see Sec 3

). Our proposed method uses a Convolutional-Recurrent Neural Network (C-RNN), and outputs per-frame votes for the presence and relative position of the

completion moment. We then predict a sequence-level completion moment by accumulating these frame-level contributions. To showcase the generality of our method, we evaluate it on 16 actions from 3 public datasets [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre, Soomro et al.(2012)Soomro, Zamir, and Shah, Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen]. These include sports-based (e.g. basketball, pole vault) as well as daily (e.g. drink, pour) actions. We show that both pre-completion and post-completion frames assist in completion moment detection for the variety of tested actions.

The remainder of this paper is organised as follows: related work in Sec. 2, problem definition in Sec. 3, proposed method in Sec. 4, experiments and results in Sec. 5 and conclusion and future work in Sec. 6.

2 Related Work

Current methods for action recognition focus on deploying convolutional neural networks (CNNs), either dual-stream convolutions 

[Simonyan and Zisserman(2014), Feichtenhofer et al.(2017)Feichtenhofer, Pinz, and Wildes, Wang et al.(2017)Wang, Long, Wang, and Yu] or 3D convolution filters from video snippets [Ji et al.(2013)Ji, Xu, Yang, and Yu, Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri, Shou et al.(2016)Shou, Wang, and Chang], as well as recurrent neural networks (RNNs) that accumulate evidence from frames over a sequence [Yeung et al.(2018)Yeung, Russakovsky, Jin, Andriluka, Mori, and Li, Donahue et al.(2015)Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, and Darrell, Yue-Hei Ng et al.(2015)Yue-Hei Ng, Hausknecht, Vijayanarasimhan, Vinyals, Monga, and Toderici]. However, these approaches aim to label the sequence as a whole. One seminal work [Wang et al.(2016)Wang, Farhadi, and Gupta] deviates by encoding the action as precondition and effect, using a Siamese network that predicts the action as a transformation between the two states. In this section, we review related works that study partial observations within a video sequence for three problems of relevance to our proposed moment detection problem,

Action Proposal Generation: Action proposals and action-ness measures have become the platform for several action localisation approaches [Jain et al.(2014)Jain, Van Gemert, Jégou, Bouthemy, and Snoek, Gkioxari and Malik(2015), Yu and Yuan(2015), Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin, Xiong et al.(2017)Xiong, Zhao, Wang, Lin, and Tang]. Among these, [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] and [Xiong et al.(2017)Xiong, Zhao, Wang, Lin, and Tang]

focus on classifying these proposals into those that contain the ‘completed’ action, and incomplete proposals that should be rejected. While

[Xiong et al.(2017)Xiong, Zhao, Wang, Lin, and Tang] applies an SVM to filter and reject spatio-temporal proposals containing incomplete or partial actions, [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] has embedded the rejection within an end-to-end CNN. These approaches classify each proposal, and do not attempt to locate or assess the completion moment.

Action Anticipation: A few recent works [Mahmud et al.(2017)Mahmud, Hasan, and Roy-Chowdhury, Farha et al.(2018)Farha, Richard, and Gall] focus on predicting the class label of the next unobserved action. Mahmud et al[Mahmud et al.(2017)Mahmud, Hasan, and Roy-Chowdhury] predict the next action as well as its starting time using a hybrid Siamese network in which an LSTM is used for temporal modelling. Farha et al [Farha et al.(2018)Farha, Richard, and Gall]estimate the time remaining until the next action, as well as the length and the label of the next action. The paper compares the usage of either an RNN or a CNN that takes as input concatenated frame-level features into a single tensor. These approaches do not discuss completion (or incompletion) of the observed action.

Early Detection: Several works [Hoai and De la Torre(2014), Aliakbarian et al.(2017)Aliakbarian, Saleh, Salzmann, Fernando, Petersson, and Andersson, Ma et al.(2016)Ma, Sigal, and Sclaroff, Becattini et al.(2018)Becattini, Uricchio, Ballan, Seidenari, and Del Bimbo, Yeung et al.(2016)Yeung, Russakovsky, Mori, and Fei-Fei, Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu]

address early detection of partially observed actions, from as few frames as possible. These mainly propose loss functions to encourage early detection 

[Aliakbarian et al.(2017)Aliakbarian, Saleh, Salzmann, Fernando, Petersson, and Andersson, Ma et al.(2016)Ma, Sigal, and Sclaroff], but a few works attempt fine-grained understanding of the action’s progression. In [Hoai and De la Torre(2014)], an SVM classifier is trained to accumulate scores from partial observations of the action, where the score is highest when the action is fully observed. The approach has been tested on facial and gesture datasets. Similarly, in [Becattini et al.(2018)Becattini, Uricchio, Ballan, Seidenari, and Del Bimbo], an RNN is trained to predict the action label, as well as its linear progress towards its conclusion as a percentage (e.g. 50% of the action has taken place).

Two approaches [Yeung et al.(2016)Yeung, Russakovsky, Mori, and Fei-Fei, Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu] which detect moments within the sequence have been proposed, albeit for early detection and localisation. In [Yeung et al.(2016)Yeung, Russakovsky, Mori, and Fei-Fei] individual frames predict the location of the next frame to be observed, using an RNN. The work aims for action detection with as few frames as possible, thus the trained model proposes transitions within the sequence, by predicting the relative position of the frame to be observed next. Our work is inspired by ideas in [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu], where action detection uses a joint classification-regression RNN. The classification branch predicts the ongoing action label which is then used by a regression branch to predict the start and the end points of the action, relative to the current frame. A Gaussian scoring function is used to encode the prediction uncertainty. The approach was tested on 3D skeletal data for localisation, oblivious to the action’s completion (or incompletion).

None of the works mentioned above consider whether the action actually achieves its aim. In this work, we build on our previous work that introduced the action completion problem [Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen] by classifying whole sequences into complete and incomplete, and take inspiration from [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu] to propose a joint classification-regression architecture. As opposed to predicting the next or the ongoing action, we detect the completion moment by accumulating evidence from frame-level decisions. We further define the completion moment detection problem in the next section.

    

    

    

Figure 1: Annotation of completion moment: Two examples per action. Pre-completion frames are bordered in orange and post-completion in purple. From Top: HMDB pick, UCF101 blowing candles, RGBD-AC switch (one complete sequence and one incomplete).

3 Action Completion - A Moment in Time

We first present our proposal for formulating the problem of localising an action’s completion as detecting a moment in time, beyond which the action’s goal is believed to be completed by a human observer. We make three reasonable assumptions:

  • [leftmargin=*]

  • Momentary Completion: We aim to detect a single frame in the sequence - that is the first frame where a human observer would be sufficiently confident that the goal has been achieved. We refer to frames prior to the completion moment as pre-completion frames, and those from the completion moment onwards as post-completion frames.

  • Temporally Segmented Sequences: We assume that the action is attempted during each sequence, in train or test, at least once, but not necessarily completed. We aim to detect the first completion moment per sequence, if at all, or label the attempt as being incomplete.

  • Consistent Labeling: For each action, we assume annotators are given a non-ambiguous definition of the completion moment, so all train and test sequences are labeled consistently. For example, in the action ‘blowing candle’, the consistent label for the completion moment should indicate the moment when the flames of all candles go out. Note that the proposed model is independent of the definition of the completion moment per action. It only assumes the moment is consistently labeled across sequences.

Figure 1 shows sample sequences, labeled with completion moments, for three actions from the various datasets we annotate and use: (i) pick from HMDB, where the completion moment is when the object is lifted off the surface (ii) blowing candles from UCF101, where the completion moment is when all the candles are blown out and (iii) switch light from RGBD-AC, where completion moment is when the room’s illumination changes.

Labeled sequences for a given action are the input to our method, presented next. For each sequence , one completion moment is labeled if present, which we refer to as , such that , and is the sequence length, or the sequence is labeled as incomplete.

4 Temporal Model for Moment Detection

To detect the completion moment within a sequence, one could naively attempt to train a classifier that singularly separates the frame indicating the completion moment from the rest of the video. However, evidence for the completion moment can be collected from all (or any) frames in the sequence. Take for example the action ‘pick’; the pose of the person is likely to change and evolve as they approach the object to be picked, and similarly observing the object in hand as the hand retracts gives further support for completion. We propose a temporal model that learns local (i.e. frame-level) predictions, within a recurrent neural network, towards global (i.e. sequence-level) detection, trainable end-to-end.

Our proposed temporal model is a Convolutional-Recurrent Neural Network. We describe the frame-level voting nodes in Sec 4.1 and then show how the unfolded temporal model, over a sequence, can accumulate votes towards moment detection in Sec 4.2.

4.1 Frame-level Voting Recurrent Node

Each frame in the sequence, whether prior to the completion moment, or post completion, could contribute to the completion moment detection. We refer to this contribution as ‘voting’, i.e. a frame can vote for when the action will be (or has been) completed. Two ways are proposed in which such voting can take place:

  1. [leftmargin=*]

  2. Classification Voting: At each time step , the sequence is split into two parts: and , where is the sequence length. The classification vote primarily distinguishes the split within which the completion moment resides.

  3. Regression Voting: At each time step , the relative position of the completion moment is predicted, normalised to allow for sequences of various lengths.

Figure 2 shows the architecture of our proposed frame-level voting recurrent node, which can be used to predict both the classification and regression votes defined above, trained using a joint classification-regression loss function. Each input frame is passed through convolutional, pooling and fully connected layers. Then, an LSTM layer combines past information with the current observation. The LSTM output is trained to perform frame-level classification as well as frame-level regression as follows:

Figure 2: The input image passes through convolutional, pooling and fully-connected layers, and then an LSTM cell to capture temporal dependencies from the past. The node outputs classification and regression votes for the completion moment.

Frame-Level Classification Voting (): To decide whether the completion moment is before or after the current time step , we primarily need to predict whether the current observation is pre- or post- completion. We thus train for by classifying the current observation, using a Sigmoid cross-entropy loss function on top of the LSTM hidden layer, such that

(1)
(2)

where and are the weights and biases for classification, respectively,

is the Sigmoid activation function and

is the supervised label. The pre- and post- class labels are assigned to all frames and , respectively; sequence subscript is removed for simplicity.

The classifier then allows to vote for the presence of the completion moment in one of two splits of the sequence, namely or . Specifically, if the observation at time is classified as being pre-completion, then the completion moment is believed to be within , or could be incomplete. To account for incompletion, we extend the end of the second split to , to allow votes to be cast for an incomplete sequence, so the second split becomes . Otherwise, the completion moment is believed to be within . The classification vote contributes equally to voting within the split. We define

as a one-dimensional vector of length

, representing the vote assigned to all frames in the sequence. For each frame , the vote cast by the current frame , , is

(3)

The frame-level classification votes are then accumulated (see Sec. 4.2).

Frame-Level Regression Voting : While assigns an equal vote to all frames within each of the splits in the sequence, defined by , regression voting provides stronger evidence that can localise the completion moment, by predicting its relative position to . This relative position encapsulates the remaining time to or elapsed time from the completion moment. We compute the relative time as that between and the completion moment , normalised by , i.e. . This provides a more robust relative temporal position than the alternative which would differ with the length of the sequence. Note that this value is negative during pre-completion, that is .

To train for frame-level regression, the hidden output in the voting recurrent node learns to predict the relative time, using a Euclidean loss function, to obtain

(4)
(5)

where and are the weights and biases for regression, respectively. can then be used to predict the completion moment at the corresponding time as .

Similar to classification voting, we define as a one-dimensional voting vector, and use a Gaussian with uncertainty around the predicted completion moment, , such that

(6)

where represents the inverse of the selected area under curve of the Gaussian. Experimentally, we only compute the regression vote within a window of size , in order to reduce the complexity of calculating the vote for all time steps in the sequence.

Training Loss: As a forward recurrent neural network, we can then train all parameters using a combined loss on all sequences and their frames, specified as,

(7)

allowing all sequences to contribute equally to the loss function regardless of the sequence length. The loss is propagated back through the recurrent voting nodes.

4.2 Sequence-level prediction of completion moment

The votes by individual frames are accumulated to make sequence-level predictions of the completion moment. Note that we do not propagate ambiguity in the decisions of the individual frames, and assume each frame is equally certain about its votes. Other approaches that could integrate frame voting uncertainty, or learn temporal attention, are left for future investigation. We focus on assessing the robustness of using the classification vs the regression votes as follows: (i) Classificationpre-Classificationpost (C-C): all frames use classification-based voting, (ii) Regressionpre-Regressionpost (R-R): all frames vote using their regression-based voting, (iii) Regressionpre-Classificationpost (R-C): frames classified as pre-completion use their regression-based voting, while post- frames use classification-based voting, and correspondingly (iv) Classificationpre-Regressionpost (C-R). Symbolically,

(8)
(9)
(10)
(11)

Fig. 3 illustrates the various approaches to frame-based votes. The predicted sequence-level completion moment is then the frame with the maximum accumulative vote:

(12)
Figure 3: Sequence-level completion detection by accumulating frames’ votes. The schemes use classification and/or regression voting. Sample sequence from action basketball.

5 Experiments and Results

Dataset and Completion Annotation: To show the generality of our work, we select 16 actions from 3 public datasets, and annotate them for their completion moments. We avoid actions for which completion would be difficult to define or it just marks the end of the action, e.g. run, play piano, laugh. However, we select actions that cover both sports-based and daily actions. For each sequence, we provide an annotation of the first completion moment, by a single annotator111Annotations available at: https://github.com/FarnooshHeidari/CompletionDetection..

HMDB [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre]: We annotate all sequences of 5 actions: catch, drink, pick, pour and throw. In total, these are 494 sequences, of which 93.5% are complete, i.e. the action’s goal is successfully achieved. While HMDB does not aim for completion detection, a few sequences include attempts that are unsuccessful.

UCF101 [Soomro et al.(2012)Soomro, Zamir, and Shah]: We annotate all sequences of 5 actions: basketball, blowing candles, frisbee catch, pole vault and soccer penalty. These are 650 sequences, of which 80.5% are complete.

RGBD-AC [Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen]: We use the RGB input our previously introduced dataset [Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen], and annotate all 414 sequences which include 6 actions: switch, plug, open, pull, pick and drink, of which 50.5% are complete. In this dataset, subjects are disrupted from completing the action, e.g. a drawer they attempt to open is locked.

We apply ‘leave-one-person-out’ to evaluate the RGBD-AC dataset, while for HMDB and UCF101, the provided train and test splits are used.

Implementation Details: For the convolution and pooling layers, we use the spatial stream CNN from [Simonyan and Zisserman(2014)] which uses the VGG-16 architecture [Simonyan and Zisserman(2015)], pre-trained on UCF101. This CNN is then fine-tuned per action, using the two classes of pre- and post-completion

frames. For fine-tuning, 20 epochs are performed, and the learning rate is started at

, divided by 10 at epochs 3 and 5. All the other hyper-parameters are set as in [Feichtenhofer et al.(2016)Feichtenhofer, Pinz, and Zisserman].

The 4096-dimension vector of forms the input to the single LSTM layer with 128 hidden units. Initialisation is random, trained for 10 epochs. The learning rate is for the first 5 epochs and is fixed at for the remaining epochs. We use a mini-batch size of one sequence and parameters , and are set 0.1, 0.5 and 30, respectively. While the proposed method represents an end-to-end trainable model, in the presented results, we train a CNN and feed the features into the LSTM. Efficient end-to-end training of the proposed temporal model is challenging using available hardware, and is left for future work.

Accuracy
Pre-V C-C R-R R-C C-R Pre-V C-C R-R R-C C-R

HMDB

catch 77.3 79.1 75.9 80.5 76.7 82.3 0.23 0.21 0.24 0.20 0.23 0.18
drink 77.3 69.3 73.2 78.0 75.9 80.5 0.21 0.31 0.27 0.22 0.24 0.19
pick 80.6 79.5 79.7 79.9 74.7 84.2 0.22 0.20 0.20 0.20 0.25 0.16
pour 76.5 68.3 71.1 80.0 78.7 81.2 0.23 0.32 0.29 0.20 0.21 0.19
throw 68.7 74.3 63.4 74.6 65.8 80.4 0.32 0.26 0.37 0.25 0.34 0.20

UCF101

basketball 86.5 78.0 84.5 79.5 79.1 85.1 0.21 0.22 0.16 0.20 0.21 0.15
blowing candles 86.8 88.3 86.4 84.2 78.2 90.9 0.16 0.12 0.14 0.16 0.22 0.09
frisbee catch 81.7 84.1 80.3 78.3 74.6 85.9 0.24 0.16 0.20 0.22 0.25 0.14
pole vault 85.0 83.3 82.6 88.4 80.1 90.6 0.19 0.17 0.17 0.12 0.20 0.09
soccer penalty 85.5 86.6 85.8 87.1 85.6 88.5 0.15 0.13 0.14 0.13 0.14 0.11

RGBD-AC

switch 99.9 93.9 99.9 98.1 92.7 98.9 0.00 0.06 0.00 0.02 0.07 0.01
plug 98.3 93.2 98.5 96.1 93.0 97.2 0.02 0.07 0.01 0.04 0.07 0.03
open 91.1 86.1 91.1 86.7 80.4 89.9 0.12 0.14 0.09 0.13 0.20 0.10
pull 97.7 89.1 97.8 94.1 91.5 97.0 0.10 0.11 0.02 0.06 0.08 0.03
pick 91.5 89.1 89.9 93.2 83.6 95.0 0.11 0.11 0.10 0.07 0.16 0.05
drink 88.6 79.0 85.3 90.9 85.8 92.1 0.11 0.21 0.15 0.09 0.14 0.08
complete 82.3 78.1 79.6 83.1 77.7 85.6 0.19 0.22 0.20 0.17 0.22 0.14
incomplete 93.4 94.8 94.3 90.4 88.8 96.1 0.13 0.05 0.06 0.10 0.11 0.04
total 85.0 82.2 83.2 84.9 80.4 88.1 0.17 0.18 0.17 0.15 0.20 0.12
Table 1: Results on all 16 actions, comparing frame-level classification, last-frame regression and the four sequence-level voting schemes.

Evaluation Metrics:

We assess the proposed model using two evaluation metrics, (i) Accuracy: for every sequence, we compute the ratio of frames that are consistently labeled as

pre- or post- the completion moment, given the predicted and labeled moments,

(13)

where is the number of sequences, and (ii) the average relative distance error (RD) in predicting the completion moment,

(14)
Figure 4: Sample results for four sequences: soccer penalty, pick, pole vault and pour.

Results: In Table 1, C-R voting outperforms R-R and R-C in all actions of the three datasets. It also outperforms C-C for most actions. This outcome shows that pre-completion frames, while confident about the completion moment being later, are unable to robustly predict the remaining time to completion. In contrast, a post-completion frame which has indeed observed the completion moment is able to have a more reliable prediction of its relative position via regression. This also explains the poor results of method R-C in which only pre-completion frames use regression-based voting.

In Table 1, we also show two baselines: (i) Pre-Voting (Pre-V): the classification output of the LSTM hidden layer is used solely without voting. This frame-level result can have fluctuations as shown in Fig. 4. In this case, we use the first predicted post-completion frame as . For HMDB and UCF101 datasets, the proposed method outperforms the frame-level classification, while for RGBD-AC, they perform comparably. This is because the RGBD-AC dataset is captured in one environment with a single viewpoint and thus the frame-level classifications tend to generalise easily to new sequences. Note that for action basketball from UCF101, while Pre-V performs highly on the accuracy evaluation metric, the error is higher than that of our proposed method. (ii) Last-frame regression (): We only use the regression vote of the last frame. As a forward RNN is used one might question whether the accumulated result at the end of the sequence is sufficient. We show that this result is less robust than accumulating votes from all frames. Table 1 also summarises the results of complete and incomplete test sequences separately. Further action-specific results for complete and incomplete sequences are included in Appendix.

Four qualitative examples are presented in Fig. 4. (1) For soccer penalty, only C-R matches exactly the ground-truth with Pre-V and C-C giving comparable results. Using regression-voting for pre-completion frames negatively affects the completion moment detection. (2) An incomplete pick is correctly recognized by both C-R and C-C voting methods. (3) For pole vault, fluctuating frame-level classifications are shown. C-R provides the closest estimation for the completion moment. (4) For pour, the completion moment when the liquid is poured is predicted 5 frames earlier when using C-R voting, compared to 10 frames when R-R is used 222Video of results is available at: https://youtu.be/Hrxehk3Sutc..

We also present the accumulative percentage of sequences which detect the completion moment within a certain threshold in Fig. 5. We define that threshold as the absolute difference, in frames, between the predicted and ground-truth completion moments. Results are shown for the C-R method. We correctly detect the completion moment within 1 second (30 frames) in 89% of all test sequences, and within 0.5 second (15 frames) in 74% of sequences. Also completion moment is detected for 30.4% of sequences at the very same frame as ground-truth (i.e. 0 frame difference). Graphs are plotted for each of the 16 actions as well as the total of all actions.

Figure 5: Cumulative percentage of sequences where the completion moment is detected within frames. Acceptance threshold is shown on the x-axis. Results are shown for each of the 16 actions as well as across all actions.

6 Conclusion and Future Work

This paper presents action completion moment detection as the task of localising the moment in time when a human observer believes an action’s goal has been achieved. The approach goes beyond recognition of completion towards a fine-grained perception of completion. We use a supervised approach for detecting completion per action, and propose an end-to-end trainable recurrent model. We show that individual frames can contribute to predicting a sequence-level completion moment via voting, and propose four methods to accumulate frame-level votes. Results show that using classification voting for pre-completion frames, and regression-voting for post-completion frames achieves the overall best result.

We foresee the proposed temporal model as a powerful learning method for moment detection in actions, for and beyond action completion. We aim to pursue two directions for future work. First, we plan to extend our work to untrimmed videos and propose temporal models able to detect multiple completion moments. Second, we shall explore weakly-supervised approaches to completion moment detection.

Appendix A

Accuracy
No. Pre-V C-C R-R R-C C-R

HMDB

catch complete 99 77.3 79.1 75.9 80.5 76.7 82.3
incomplete 0 - - - - - -
total 99 77.3 79.1 75.9 80.5 76.7 82.3

drink complete 96 76.6 68.5 72.0 77.3 75.3 80.0
incomplete 4 92.4 87.4 99.5 94.0 88.3 91.7
total 100 77.3 69.3 73.2 78.0 75.9 80.5

pick complete 76 79.4 75.4 78.2 79.4 77.5 82.6
incomplete 22 84.4 92.0 84.1 81.5 66.4 88.9
total 98 80.6 79.5 79.7 79.9 74.7 84.2

pour complete 98 77.3 68.5 71.9 80.7 79.5 81.9
incomplete 1 4.5 50.5 2.7 17.1 9.0 22.5
total 99 76.5 68.3 71.1 80.0 78.7 81.2

throw complete 95 67.2 73.3 61.7 74.1 64.9 79.5
incomplete 3 100.0 95.6 100.0 84.6 86.0 100.0
total 98 68.7 74.3 63.4 74.6 65.8 80.4



UCF101

basketball complete 102 84.7 73.1 80.3 79.6 78.2 81.1
incomplete 32 92.3 93.9 97.7 79.2 82.0 97.8
total 134 86.5 78.0 84.5 79.5 79.1 85.1

blowing candles complete 59 80.3 80.7 78.5 78.4 67.9 84.1
incomplete 50 94.2 96.8 95.3 90.6 89.7 98.5
total 109 86.8 88.3 86.4 84.2 78.2 90.9

frisbee catch complete 125 81.7 84.1 80.3 78.3 74.6 85.9
incomplete 0 - - - - - -
total 125 81.7 84.1 80.3 78.3 74.6 85.9

pole vault complete 142 85.0 83.3 82.4 88.5 79.8 90.6
incomplete 3 87.4 81.8 88.5 84.0 92.1 90.9
total 145 85.0 83.3 82.6 88.4 80.1 90.6

soccer penalty complete 95 85.3 83.5 84.4 86.8 83.6 86.9
incomplete 42 86.0 93.5 88.8 87.7 90.0 92.1
total 137 85.5 86.6 85.8 87.1 85.6 88.5



RGBD-AC

switch complete 35 99.8 88.7 99.8 96.3 86.0 98.0
incomplete 32 100 99.7 100.0 100.0 100.0 100.0
total 67 99.9 93.9 99.9 98.1 92.7 98.9

plug complete 37 96.8 90.0 97.1 92.8 86.3 94.4
incomplete 36 99.8 96.4 100.0 99.4 100.0 100.0
total 73 98.3 93.2 98.5 96.1 93.0 97.2

open complete 36 84.6 75.3 83.1 86.9 80.3 80.9
incomplete 32 98.3 98.4 100.0 86.4 80.5 100.0
total 68 91.1 86.1 91.1 86.7 80.4 89.9

pull complete 34 96.4 85.2 95.4 95.4 85.9 95.9
incomplete 37 98.9 92.6 100.0 92.8 96.7 98.1
total 71 97.7 89.1 97.8 94.1 91.5 97.0

pick complete 33 92.4 83.3 90.9 93.0 76.3 95.4
incomplete 36 90.7 94.3 89.0 93.4 90.2 94.5
total 69 91.5 89.1 89.9 93.2 83.6 95.0

drink complete 34 89.3 66.3 83.1 92.7 87.9 92.8
incomplete 32 87.9 92.5 87.6 89.0 83.5 91.3
total 66 88.6 79.0 85.3 90.9 85.8 92.1

complete
1196 82.3 78.1 79.6 83.1 77.7 85.6
incomplete 362 93.4 94.8 94.3 90.4 88.8 96.1
total 1558 85.0 82.2 83.2 84.9 80.4 88.1

Table 2: Results on 16 actions, comparing frame-level classification, last-frame regression and the four sequence-level voting schemes.
Pre-V C-C R-R R-C C-R

HMDB

catch complete 0.23 0.21 0.24 0.20 0.23 0.18
incomplete - - - - - -
total 0.23 0.21 0.24 0.20 0.23 0.18

drink complete 0.21 0.32 0.28 0.23 0.25 0.20
incomplete 0.38 0.13 0.00 0.06 0.12 0.08
total 0.21 0.31 0.27 0.22 0.24 0.19

pick complete 0.20 0.25 0.22 0.21 0.23 0.17
incomplete 0.29 0.08 0.16 0.18 0.34 0.11
total 0.22 0.20 0.20 0.20 0.25 0.16

pour complete 0.22 0.31 0.28 0.19 0.20 0.18
incomplete 0.97 0.50 0.97 0.83 0.91 0.77
total 0.23 0.32 0.29 0.20 0.21 0.19

throw complete 0.33 0.27 0.38 0.26 0.35 0.21
incomplete 0.00 0.04 0.00 0.15 0.14 0.00
total 0.32 0.26 0.37 0.25 0.34 0.20



UCF101

basketball complete 0.19 0.27 0.20 0.20 0.22 0.19
incomplete 0.27 0.06 0.02 0.21 0.18 0.02
total 0.21 0.22 0.16 0.20 0.21 0.15

blowing candles complete 0.20 0.19 0.22 0.22 0.32 0.16
incomplete 0.11 0.03 0.05 0.09 0.10 0.02
total 0.16 0.12 0.14 0.16 0.22 0.09

frisbee catch complete 0.24 0.16 0.20 0.22 0.25 0.14
incomplete - - - - - -
total 0.24 0.16 0.20 0.22 0.25 0.14

pole vault complete 0.19 0.17 0.18 0.12 0.20 0.09
incomplete 0.18 0.18 0.11 0.16 0.08 0.09
total 0.19 0.17 0.17 0.12 0.20 0.09

soccer penalty complete 0.15 0.17 0.16 0.13 0.16 0.13
incomplete 0.16 0.06 0.11 0.12 0.10 0.08
total 0.15 0.13 0.14 0.13 0.14 0.11


RGBD-AC

switch complete 0.00 0.11 0.00 0.04 0.14 0.02
incomplete 0.00 0.00 0.00 0.00 0.00 0.00
total 0.00 0.06 0.00 0.02 0.07 0.01

plug complete 0.04 0.10 0.03 0.07 0.14 0.06
incomplete 0.01 0.04 0.00 0.01 0.00 0.00
total 0.02 0.07 0.01 0.04 0.07 0.03

open complete 0.13 0.25 0.17 0.13 0.20 0.19
incomplete 0.12 0.02 0.00 0.14 0.19 0.00
total 0.12 0.14 0.09 0.13 0.20 0.10

pull complete 0.05 0.15 0.05 0.05 0.14 0.04
incomplete 0.14 0.07 0.00 0.07 0.03 0.02
total 0.10 0.11 0.02 0.06 0.08 0.03

pick complete 0.09 0.17 0.09 0.07 0.24 0.05
incomplete 0.13 0.06 0.11 0.07 0.10 0.05
total 0.11 0.11 0.10 0.07 0.16 0.05

drink complete 0.09 0.34 0.17 0.07 0.12 0.07
incomplete 0.12 0.08 0.12 0.11 0.17 0.09
total 0.11 0.21 0.15 0.09 0.14 0.08

complete
0.19 0.22 0.20 0.17 0.22 0.14
incomplete 0.13 0.05 0.06 0.10 0.11 0.04
total 0.17 0.18 0.17 0.15 0.20 0.12

Table 3: Results on 16 actions, comparing frame-level classification, last-frame regression and the four sequence-level voting schemes.

For completion, we present the full set of results in two tables.

  • Table 2 presents the accuracy for complete and incomplete sequences of the three datasets separately. For the 362 incomplete sequences, across all datasets, the accuracy when using the C-R method is 96.1%. For the 1196 complete sequences, the accuracy when using the C-R method is 85.6%.

  • Table 3 shows the RD evaluation measure for the complete and incomplete sequences of the three datasets separately. Again, C-R voting has the lowest RD error with 0.14 for all complete sequences and 0.04 for all incomplete sequences.

References

  • [Aliakbarian et al.(2017)Aliakbarian, Saleh, Salzmann, Fernando, Petersson, and Andersson] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging LSTMs to anticipate actions very early. In CVPR, 2017.
  • [Becattini et al.(2018)Becattini, Uricchio, Ballan, Seidenari, and Del Bimbo] F. Becattini, T. Uricchio, L. Ballan, L. Seidenari, and A. Del Bimbo. Am I done? predicting action progress in videos. arXiv preprint arXiv:1705.01781, 2018.
  • [Donahue et al.(2015)Donahue, Hendricks, Guadarrama, Rohrbach, Venugopalan, Saenko, and Darrell] J. Donahue, L. A. Hendricks, S Guadarrama, M Rohrbach, S Venugopalan, K Saenko, and T Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [Farha et al.(2018)Farha, Richard, and Gall] Y. A. Farha, A. Richard, and J. Gall. When will you do what? anticipating temporal occurrences of activities. In CVPR, 2018.
  • [Feichtenhofer et al.(2016)Feichtenhofer, Pinz, and Zisserman] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [Feichtenhofer et al.(2017)Feichtenhofer, Pinz, and Wildes] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
  • [Gkioxari and Malik(2015)] G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
  • [Heidarivincheh et al.(2016)Heidarivincheh, Mirmehdi, and Damen] F. Heidarivincheh, M. Mirmehdi, and D. Damen. Beyond action recognition: Action completion in RGB-D data. In BMVC, 2016.
  • [Hoai and De la Torre(2014)] M. Hoai and F. De la Torre. Max-margin early event detectors. IJCV, 2014.
  • [Jain et al.(2014)Jain, Van Gemert, Jégou, Bouthemy, and Snoek] M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, and C. Snoek. Action localization with tubelets from motion. In CVPR, 2014.
  • [Ji et al.(2013)Ji, Xu, Yang, and Yu] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. PAMI, 2013.
  • [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
  • [Li et al.(2016)Li, Lan, Xing, Zeng, Yuan, and Liu] Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu. Online human action detection using joint classification-regression recurrent neural networks. In ECCV, 2016.
  • [Ma et al.(2016)Ma, Sigal, and Sclaroff] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in LSTMs for activity detection and early detection. In CVPR, 2016.
  • [Mahmud et al.(2017)Mahmud, Hasan, and Roy-Chowdhury] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In ICCV, 2017.
  • [Shou et al.(2016)Shou, Wang, and Chang] Z. Shou, D. Wang, and S. Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR, 2016.
  • [Simonyan and Zisserman(2014)] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS. 2014.
  • [Simonyan and Zisserman(2015)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ICLR, 2015.
  • [Soomro et al.(2012)Soomro, Zamir, and Shah] K. Soomro, A. Roshan Zamir, and M. Shah. A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [Tian et al.(2013)Tian, Sukthankar, and Shah] Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In CVPR, 2013.
  • [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, 2015.
  • [Wang et al.(2016)Wang, Farhadi, and Gupta] X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In CVPR, 2016.
  • [Wang et al.(2017)Wang, Long, Wang, and Yu] Y. Wang, M. Long, J. Wang, and P. S. Yu. Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
  • [Xiong et al.(2017)Xiong, Zhao, Wang, Lin, and Tang] Y. Xiong, Y. Zhao, L. Wang, D. Lin, and X. Tang. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716, 2017.
  • [Yeung et al.(2016)Yeung, Russakovsky, Mori, and Fei-Fei] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016.
  • [Yeung et al.(2018)Yeung, Russakovsky, Jin, Andriluka, Mori, and Li] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and F.-F. Li. Every moment counts: Dense detailed labeling of actions in complex videos. IJCV, 2018.
  • [Yu and Yuan(2015)] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In CVPR, 2015.
  • [Yue-Hei Ng et al.(2015)Yue-Hei Ng, Hausknecht, Vijayanarasimhan, Vinyals, Monga, and Toderici] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In ICCV, 2017.