Human Action Sequence Classification

10/07/2019 ∙ by Yan Bin Ng, et al. ∙ Agency for Science, Technology and Research 8

This paper classifies human action sequences from videos using a machine translation model. In contrast to classical human action classification which outputs a set of actions, our method output a sequence of action in the chronological order of the actions performed by the human. Therefore our method is evaluated using sequential performance measures such as Bilingual Evaluation Understudy (BLEU) scores. Action sequence classification has many applications such as learning from demonstration, action segmentation, detection, localization and video captioning. Furthermore, we use our model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization. We obtain state of the art results for video captioning in challenging Charades dataset obtaining BLEU-4 score of 34.8 and METEOR score of 33.6 outperforming previous state-of-the-art of 18.8 and 19.5 respectively. Similarly, on ActivityNet captioning, we obtain excellent results in-terms of ROUGE (20.24) and CIDER (37.58) scores. For action localization, without using any explicit start/end action annotations, our method obtains localization performance of 22.2 mAP outperforming prior fully supervised methods.



There are no comments yet.


page 1

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human action recognition from videos aims to recognize a set of predefined human actions in a given video [19, 44, 37, 14]. To better understand human actions, many related problems have been investigated in the literature, e.g. action detection [15, 55], spatial-temporal action localization [40], action segmentation [20], and early action prediction [18, 33]. All these problems involve classifying videos into action categories at some level. Recently, more challenging human action understanding problems have been proposed such as video captioning [56, 24, 43], text-based temporal activity localization [47] and complex activity recognition [13].

Figure 1: If asked to explain the video, one would say “the man sits down, reads the book and stands up”. It is more human like to describe a video using a sequence of actions.
Figure 2: Given a complex activity description and the video, action-sequence-classificationcan also output the sequence of human actions that are required to accomplish the complex activity. This can be used to solve problems such as learning from demonstration.

Most natural videos consist of action sequences and it is also instinctive to describe them using action sequences rather than a set of actions. If we were asked to explain the activity shown in the video of Figure 1, it is likely that we would say “the man sits down, reads the book and stands up” in that order which implicitly embodies the temporal order of actions. This indicates a system that is able to generate an action sequence for a given video is more natural to humans. Usually, we would not explicitly mention the start and end times of an action as when describing a video as done in activity detection. Besides, a model’s ability to predict sequences of actions from a video has many applications, e.g. learning from demonstration [1]. Furthermore, it can be used for video retrieval from temporal action query such as ”find videos of player hits the umpire and gets red card” (query hit the umpire gets the red card). In this paper we analyze the problem where we are given only the video and ground truth action sequence at training time. During the test time, given the video, model should output the correct sequence of actions as humans do. We call this task action-sequence-classification and investigate in this paper.

Training a model that is able to output an accurate action sequence without precise temporal annotations is challenging. The model has to learn complex temporal dynamics and relationships between actions of the video before producing the action sequence. Furthermore, it has to learn implicitly when and where actions do happen and do not only using the input sequence of action labels. Because one-to-one correspondences between frames and actions are not given, this becomes a difficult learning task. In a way, model has to align input frame data with semantic action sequence while implicitly learning each human action category. Ideally, the model should learn complex relationships and inter-dependencies between actions to further improve action-sequence-classificationperformance. The most challenging is how to determine the number of actions (that is the length of output action sequence) within the video. Too many or less actions in the predicted sequence would significantly hinder the performance. All of these has to be learned simply using given training data consist of videos and action sequences.

Action-sequence-classificationis somewhat related to action detection which involves predicting the start and end of the action along with the confidence of each prediction. Technically, supervised action detection might be relatively easier task than action-sequence classification during training. However, inference for action detection is challenging and therefore, action detectors make use of explicit temporal annotations during training and action proposals for training and inference. In contrast, we are given only the action sequence without explicit temporal annotations to train our model somewhat similar to weakly supervised action detection [27, 25, 9, 45]. However, these methods [27, 25, 9, 45] neither output sequence of actions nor make use of explicit temporal order of actions during training. Even though action-sequence-classification is more “human like” task, it is a challenging one for the machines.

As shown in a recent study, human action boundaries are ambiguous even for humans [35] and therefore training and evaluation of supervised action detection becomes a challenging task. In contrast, our task only aims at predicting the sequence of actions and we only penalize for the wrong order of actions ignoring action boundaries explicitly. Therefore, obtaining annotations for our task is somewhat easier, practical and potentially results in consistent and accurate annotations. A model that is trained to classify action sequences has to learn action boundaries implicitly, however, the notion of “action boundaries” are not used explicitly during training or testing. Interestingly, action-sequence-classification is useful to solve other downstream problems such as action detection, localization and segmentation. We illustrate the usefulness of action-sequence classification by solving video-captioning and action localization problems. Therefore, we argue that action-sequence-classificationis conceptually interesting and practically useful. Action-sequence-classificationcan answers the question “what actions are needed to perform activity X?” as illustrated Figure 2. Therefore, it is more suitable for some human action understanding tasks [1].

In this paper we propose to tackle a new problem in human action understanding called human action-sequence-classification. Given a video depicting a complex activity, the objective is to predict the sequence of actions. In contrast to action classification and detection, this is a sequence-to-sequence learning task. We propose to solve this problem using machine translation techniques where the input is a video sequence and the output is a sequence of actions. To summarize, the contributions of this paper are as follows:

  • [leftmargin=*]

  • We propose a new task in human action understanding called human-action-sequence classification.

  • We propose a machine translation-based solution to solve this task and investigate two neural translation architectures to solve this challenging problem.

  • We evaluate the performance of our solution against several baselines on three datasets, namely the Charades, ActivityNet 1.3 and MPII Cooking datasets and show consistently better results using our model.

  • We demonstrate usefulness of action sequence classification on two downstream tasks, video captioning and action localization.

  • We obtain results significantly better than the state-of-the-art on Charades captioning and excellent results on ActivityNet 1.3 caption generation task.

  • We obtain better action localization performance outperforming previous supervised methods on Charades dataset.

2 Related work

Figure 3: Given a complex activity description and the video, action sequence classification outputs the sequence of human actions that are required to accomplish the complex activity.

We propose to tackle action sequence classification problem using LSTM-based machine translation [39]. However, typically in machine translation, both the input and output are sequences of words [39]. Several CNN-LSTM architectures are proposed to solve various action understanding problems such as action classification, action detection and early action classification. In [6] a hierarchical RNN architecture was proposed for skeleton-based action classification. To improve conventional LSTMs, [41] proposed differential LSTM to make use of spatio-temporal dynamics for video action classification. LRCN [5] is the one of the first methods to use LSTMs for action classification and the video captioning. Their video captioning solution is a sequence-to-sequence one, however they do not use encoder-decoder architecture as we do; besides video captioning is different from action sequence classification task.

The work known as “Watch-n-Patch” [49] is somewhat related to us as it attempts to understand a sequence of actions in an unsupervised manner to predict the missing action. Similarly, [52] propose an activity auto-completion (AAC) model for human activity prediction by formulating activity prediction as a query auto-completion (QAC) problem in information retrieval using learning to rank. However, they do not investigate the problem of generating action-sequences for a given video. Instead, they aim to predict the next action that is going to happen in the future. Context-aware [11] action recognition is also related to us as it attempts to explore contextual information for action recognition. Our method make use of entire video to predict the action sequence by aligning the input video sequence with the output action-sequence using an attentional mechanism. Our problem and the solution is a sequence-to-sequence one while the method in [11] study the problem of action classification which predicts a set of actions within a video.

A bi-directional RNN is used for action detection in [38]. Like most other methods that uses LSTMs/RNNs for action understanding tasks, this method also takes the video sequence as input and produces a sequence of action prediction for each frame or segment. RNN model is trained with one-to-one input-output sequence correspondences. If the input video sequence has number of elements, usually most action recognition methods that uses RNNs would output number of action predictions and then aggregate that information to make the final action classification prediction [38, 5]. However, for us, the input and the output sequence sizes and dimensions are different. The most similar to use is the work of [23]. They also make use of encoder-decoder architecture for event-detection in videos. However, they apply mean pooling over the decoder to obtain event prediction and therefore not generating a sequence of events/actions for a given video. Therefore, our work is different from them. Similarly, LSTM-based future actions of a video sequence prediction is presented in [10]. Despite the title, this paper does not output a sequence of actions, but outputs an action for each future frame. However, interestingly, their encoding process is somewhat similar to us. Our objective is to demonstrate the value of a model that is able to output an action sequence at test time for a given video. We demonstrate that it is useful to solve many downstream video understanding tasks such as action detection, segmentation, localization and video captioning. Furthermore, it can be useful for practical applications such as learning from demonstration. Therefore, our work conceptually and technically differ from these work [23, 10].

Our approach to action understanding differs from main stream action detection [54, 38] and action segmentation [32] due to the nature of supervision used. The output of these methods can be further processed to align with the action-sequence, e.g. using clustering. However, these methods use precise temporal annotations during training and therefore different from our model and the task. Perhaps weakly supervised action segmentation is the closest to our problem [4]. However, weakly supervised action segmentation is more challenging than our problem as it needs to infer temporal boundaries only using action-sequence. Similarly, in weakly supervised action detection, non of the methods can generate a sequence of actions without further processing and besides they do not make use of chronological order of actions during training [27, 25, 9, 45].

3 Action sequence classification

3.1 Problem

Given a RGB video sequence and the corresponding sequence of human actions , we learn a model that generates the action sequence from the video sequence . Here is a RGB frame and is a categorical human action. is the set of human actions and each action . The total number of human actions is fixed, i.e. . Then the model objective is to learn a set of parameter such that it can predict the action sequence as follows:


where both input sequence and the output sequence of arbitrary length. This is a sequence to sequence machine translation task [39]

where the input sequence consists of three dimensional tensors (RGB frames) and the output sequence consists of categorical symbols (action classes).

3.2 High level idea

To solve this problem, first we extract a sequence of features from each input RGB video sequence. Recently, there has been some significant work in action recognition including methods such as inflated 3D convolutions (I3D) [3], temporal relation networks [58], and temporal segment networks [46]. We use I3D features as the video clip representation and obtain a sequence of I3D features from the input video due to its good temporal footprint and good performance. Let us denote the sequence of visual features obtained for each video by

. Then we use Long Short Term Memory (LSTM) 


networks to encode I3D visual feature sequence to obtain a single vector representation (e.g. hidden state of LSTM) after processing the entire video. Afterwards, the LSTM decoder network takes the hidden state of encoder LSTM as the initial state to generate the output sequence

. The high level idea of our method is shown in Figure 3. Next we explain two state-of-the art machine translation networks and adapt them to solve our video-to-action-sequence problem.

3.3 LSTM-Encoder-Decoder: Sequence to sequence model

In this section we explain our basic sequence to sequence model coined LSTM-Encoder-Decoder which takes a sequence of image features as input and return a sequence of actions . Indeed, our model is adapted from state-of-the-art machine translation architecture proposed in [39]. This LSTM-Encoder-Decoder architecture is shown in Figure 4.

Figure 4: A visual illustration of our LSTM Encoder-Decoder architecture for video feature sequence to action sequence translation.

Let us denote the encoder LSTM by defined as follows:


where is the cell state and the is the hidden state of the encoder LSTM. For a given input feature sequence , the final cell and hidden state of the encoder LSTM is denoted by and respectively. The decoder LSTM has similar structure to the encoder LSTM, however it takes the previously predicted action class (symbol) as input and returns the next action symbol . The initial token SOS is the first input to the decoder. The decoder LSTM has two distinct linear mappings. The first one takes one-hot vector representation of the action symbol (including SOS) and returns a vector representation of that by learning an embedding matrix. Let us denote this mapping by


where is the embedding parameter. Therefore, this operation (equation 3) maps the input symbol to a dimensional vector. Therefore the actual input sequence to the decoder is during training. Let us denote the decoder LSTM by defined as follows:


where is the context state of the decoder and is the hidden state. Decoder outputs a sequence of actions using the second linear mapping that maps hidden state to obtain the next symbol as follows:


where is learned. The final hidden state (and cell state) of the encoder is used to initialize the initial hidden state (and cell state) of the decoder LSTM i.e. and .

Both encoder and the decoder is trained by minimizing the cross-entropy loss between . We train our model with end-of-sequence token EOS to determine the length of the target sequence. During inference, we stop generating as soon as we generate the EOS token. This means, the decoder has two additional symbols, the initial token SOS and the end token EOS. We also use teacher forcing strategy during training where the ground truth symbol vector is fed to the decoder in equation 4 instead of predicted output . We use this strategy 50% of the time (at random) during training.

3.4 GRU-AA: GRU alignment and attention

In section 3.3, the decoder relies only on the final encoding of the encoder LSTM (i.e. and ) to produce the output sequence. Perhaps, it is more intuitive to find the most relevant set of output states of the encoder that generates the output symbol . One way to do this is to use attention mechanism over the encoder LSTM outputs as done in [2]. Furthermore, instead of using LSTM, we use a GRU which simplifies the encoder as follows:


We also use a GRU for the decoder. For a given input feature vector sequence , the encoder GRU produces the sequence of hidden states . To produce the decoder output , GRU-AA model learns attention weights over the encoder hidden state sequence . The weight () assigned to the encoder hidden state for generating symbol is obtained by the following:


where is a learnable parameter of size and is the learnable attention matrix of size . The notation is used for column vector concatenation. Thereafter, to obtain the attention weight for encoder hidden state for generating action symbol , we use softmax function over all weights as follows:


As only a handful of hidden states in sequence contributes to generate the output symbol , it makes sense to use attention over when generating the next symbol using the decoder . To do that, we propose to compute a context vector which is a weighted sum of encoder hidden states where the weight is given by equation 8. For generating action symbol, then the context vector is obtained by equation 9.


After that, we modify the decoder GRU to take this context vector along with the action sequence vector as follows:


where is the vector concatenation. To obtain the next action symbol, we use the following linear mapping () over three concatenated vectors, i.e., the hidden state of the decoder, the attention weighted context vector and the previous action vector representation.


Overall, our model has the four parameters to learn apart from the GRU/LSTM parameters. They are the embedding parameter , the attention matrix , the attention parameter , and the output linear mapping parameter . All these parameters are learned with the cross-entropy loss. We also use teacher forcing strategy as before to train GRU-AA model to translate video sequences to action sequences.

3.5 Video captioning

Figure 5: A visual illustration of our captioning architecture that uses two GRU-AA models. The first one takes visual feature sequence as input and outputs a sequence of action predictions using the model presented in Section 3.4. Then the second GRU-AA Seq.-to-Seq. model takes this sequence of action predictions as input and outputs the sequence of words.

Our method can be used to solve other downstream tasks such as action detection and action segmentation. Here we use our method for solving video captioning which we will evaluate in the experiments section. Let us assume that we are given the sequence of video features and the corresponding captions. Captions are a sequences of words. Using Glove [28] vector representation, we transform captions (word sequences) into a sequence of Glove vectors. Then using our GRU-AA encoder-decoder model, we translate the video feature sequence into an action-prediction-sequence. Let us denote this encoder-decoder by . Instead of generating actions symbols by taking the argmax as in equation 11, for , we directly take the output predictions (score vectors) and generate the sequence of action predictions.

Then using another GRU-AA encoder-decoder model , we translate the sequence of action predictions into Glove word-sequence. Therefore, we have two Sequence-to-Sequence models (namely and ) in our video captioning architecture as illustrated in Figure 5. During training, we use captions, action-sequences and video features. During testing we generate captions only using video features. First we train action sequence generation model as in Section 3.4 and then use that model to initialize . Then we jointly train and to generate captions.

4 Experiments

4.1 Dataset

To evaluate the video-to-action-sequence classification, we need a dataset that consists of multiple actions within a video. Therefore, we use three datasets, the Charades dataset [36], MPII Cooking dataset [31] and the ActivityNet 1.3 dataset [7].

In Charades dataset, each video has a list of action labels accompanied by corresponding timestamps. The timestamps are not used in the experiments explicitly. Timestamps were used to generate the sequence of actions for each video. Charades dataset has 7,985 video for training and 1,863 videos for testing. The dataset is collected in 15 types of indoor scenes, involves interactions with 46 object classes and has a vocabulary of 30 verbs leading to 157 action classes [36]. On average there are 6.8 actions per video which is much higher than most other datasets having more than 1000 videos.

MPII-Cooking Dataset [31] has 65 fine grained actions, 44 long videos with a the total length of more than 8 hours. Twelve participants interact with different tools, ingredients and containers to make a cooking recipe. We use the standard evaluation splits where total of five subjects are permanently used in the training set. Rest six of seven subjects are added to the training set and all models are tested on a single subject and repeat seven times in a seven fold cross-validation manner. In this dataset, there are 46 actions per video on average, much larger than other datasets.

ActivityNet 1.3 dataset aims at covering a wide range of complex human activities that are of interest to people in their daily living. This dataset consists of 200 action classes and on average 2 actions per video [7].

4.2 Evaluation matrix

As this is a sequence evaluation task, we use Bilingual Evaluation Understudy (BLEU) score [26]. Specifically, we use BLEU-1 and BLEU-2 scores as some datasets contains at most two actions per video. We also report sequence-item classification accuracy which counts how many times the predicted sequence elements match the ground truth in the exact position. For example, if the ground truth is and the predicted is , as there are three elements that exactly matches the accuracy would be 3/4*100%. If the predicted is then accuracy is 2/3*100% and if the output is accuracy is 0% as none of the elements match with the ground truth.

4.3 Implementation Details

We use I3D network [3] which is pre-trained on Kinetics dataset [16] for action classification. The hidden size used for models with a single input is 512. The embedding size used for the action tokens is 512 (i.e. output size of ). We train our models with SOS, EOS

and padding symbols. Action sequence classification models are trained with batch size of 32 for 10 epochs using a learning rate of

with early stopping.

4.4 Baseline comparisons

We compare our LSTM:Encoder-Decoder (LSTM-ED) (Section 3.3) and GRU-AA (Section 3.4) with three baselines. First, we evaluate against the random performance. Second, we use the mean pooled I3D features as input and use a stacked LSTM model (LSTM-Mean) (structure is similar to our model) to output the sequence of action symbols. Third, we train a LSTM model (LSTM-SS) that also use sequence of I3D features as the input and then output the action sequence as the output. Results are reported in Table 1.

Model Class. Acc. (%) BLEU-1 BLEU-2
Charades dataset.
Random 0.35 2.02 0.02
LSTM-Mean 4.54 9.64 1.03
LSTM-SS 5.13 10.69 1.12
LSTM-ED 3.18 10.63 1.62
GRU-AA 4.64 14.83 1.62
MPII Cooking.
LSTM-Mean 11.17 15.47 8.75
LSTM-SS 14.06 19.92 10.65
GRU-AA 14.86 25.74 14.34
ActivityNet 1.3
LSTM-Mean 34.43 38.09 2.95
LSTM-SS 10.70 12.31 0.50
GRU-AA 45.0 51.53 3.64
Table 1: Comparison of results for action sequence classification task using I3D features on three action recognition datasets.

First, from these results we see that all models performs better than random performance. On average the results on Charade dataset is the lowest indicating the difficulty of the task on this challenging dataset. All models obtain the best results on ActivityNet dataset. Interestingly, LSTM-ED and GRU-AA method outperform other baselines except on Charades dataset where LSTM-SS baseline obtains sequence classification accuracy of 5.13. On all other datasets the best performer is GRU-AA which indicates that our design choice is the correct one for this challenging task. Somewhat unexpectedly, LSTM-Mean model also obtains good results perhaps suggesting the power of I3D features. Interestingly, our method (GRU-AA) obtains significantly better results than other baselines on ActivityNet dataset while obtaining a BLEU-1 score of 51.33. However the BLUE-2 score is still relatively lower than one would like yet better than other baselines. Nevertheless, still ours is able to obtain a BLEU-2 score of 14.34 on MPII Cooking dataset where there are more actions in each video. Results from MPII Cooking dataset suggest that ours is the best method by a large margin in terms of BLEU scores. From all these results we conclude that the best performing method is our GRU-AA model.

4.5 Downstream Task 1: Video captioning

Figure 6: Generated captions using our model. Ground truth (GT) captions are shown in green and the predicted ones are in blue.

Now we evaluate our video captioning method presented in Section 3.5

using Charades and ActivityNet1.3 dense captioning datasets 

[17]. We use the I3D feature sequence to first generate the sequence of actions (score sequence) and then use another sequence-to-sequence model (GRU-AA) to translate the sequence of action-predictions into captions. Therefore, our Stage-GRU-AA model consists of two stages of GRU-AA models as explained in section 3.5.

We use standard captioning protocol and data as in [57, 50, 8, 21] for Charades. Recently, authors of [48] introduced a new set of captions for Charades dataset called “Charades captions” which we also use. The most other recent methods [57, 50] use the original Charades captions and we also used the exact data as used in [57, 50, 8, 21]. The dataset is split into three parts, including 7985 for training, of which 30% is used for validation, and 1863 for testing. We use the protocol of [17] for ActivityNet captioning.

First we compare all baseline models using both Charades and ActivityNet 1.3 dense captioning in Table 2. The first LSTM baseline takes the mean I3D features as input and generates captions for the video using a stacked LSTM (LSTM-Mean). The second LSTM baseline model takes the sequence of I3D features as input and generates the captions using a stacked LSTM denoted by (LSTM-SS). The third baseline is an LSTM:Encoder-Decoder (LSTM-ED) model presented in section 3.3. Here the model takes the sequence of I3D features as input and generates the output captions. The fourth baseline is a GRU-AA model presented in section 3.4

that takes the sequence of I3D features and outputs the captions. We use the same GRU encoder-decoder with attention model as described in section 


Charade captions dataset.
LSTM-Mean 22.76 29.49
LSTM-SS 22.45 28.70
LSTM-ED 19.08 29.22
GRU-AA 22.34 32.61
Stage-GRU-AA 33.60 38.15
ActivityNet1.3 captions dataset.
LSTM-Mean 6.12 17.94
LSTM-SS 5.38 16.34
Stage-GRU-AA 8.24 20.23
Table 2: Charades video captioning results using our method presented in Section 3.5.

From Table 2, we see that LSTM-Mean model performs quite reasonably despite it’s simplicity. Somewhat surprisingly, the LSTM-SS model shows decrease in performance in both datasets. LSTM-ED model that uses LSTM-based encoder-decoder for video caption generation obtains lower results in METEOR but somewhat moderate results for ROUGE-L. GRU-AA model that takes the visual feature sequence and translates to captions seems to work relatively better than all previous methods indicating that attention is important for this task. Our Stage-GRU-AA model that first translate the videos to action sequence and then action sequence to captions obtains the best results in both datasets. The improvement is quite significant. This shows the importance of learning this intermediate high-level semantic representation of temporal information in the form of action sequence for downstream task of video captioning.

Next we compare our model performance with state-of-the-art methods and report results in Table 3 for challenging Charades dataset.

Method B1 B2 B3 B4 M
Charade dataset results
S2VT [42] ICCV15 49.0 30.0 18.0 11.0 16.0
SA [53] ICCV15 40.3 24.7 15.5 10.8 14.3
MAAM [8] 2016 50.0 31.1 18.8 11.5 17.6
MAM [21] IJCAI17 53.0 31.7 21.3 13.3 19.1
TSL [50] CVPR18 13.5 17.8
VCTF [57] ICJAI18 50.7 31.3 19.7 13.3 19.0
Our 50.4 45.6 41.0 33.5 23.4
Charade captions* dataset results
HRL [48] CVPR18 64.4 44.3 29.4 18.8 19.5
Our 53.0 48.2 42.9 34.8 33.6
Table 3: Charades video captioning results using our method presented in Section 3.5. B1-B4 stands for BLEU-1 to BLEU-4 scores and M stands for METEOR.

We can see immediately the impact of our method. We outperform all other methods in most measures (BLEU-2 to BLEU-4 and METEOR). Our results are quite significant for BLEU-4 and METEOR where we outperform recent methods such as TSL [50] and HRL [48] by almost 20.0 BLEU-4 points. Similarly, our METEOR score is 4.3 points better than the best performing method. We significantly outperform HRL [48] in METEOR score.

We believe the reason for our excellent results are three fold, i. action sequence prediction model captures the temporal evolution of human actions and keep relevant information to summarize the activity in the video, ii. our GRU-AA machine translation model is capable of effectively aligns and translate input sequence to output, and iii. the decision to use action sequence predictions scores instead of the action sequence is also crucial as score distribution contains more information and robust.

These results indicates the effectiveness of action sequence prediction for solving downstream task/problems such as video captioning. We also visualize some of the generated captions of our model in Figure 6. Our method is able to accurately generate captions containing the correct actions. Interestingly, it is not able to correctly identity the context (dining room vs office and for the second one context is missing). This is not surprising as our method is trained to predict actions accurately but not the context. In future, we aim to investigate how to include more contextual information to improve video captioning task. We also report dense captioning results using ActivityNet 1.3 dataset in Table 4 using ground truth (GT) proposal and compare with other methods that used GT proposals. We obtain good results in-terms of CIDER and ROUGE-L. However, the METEOR score is relatively low for us. This might be due to the fact that there are only  2 actions per video in ActivityNet whereas Charades has more actions and our method can better exploit those to generate better captions. Here our intension is to show that action sequence classification is useful for other downstream tasks. We hope to improve ActivityNet captioning which we leave for future work where there are only few actions per video.

Method M C R
Dense[17] 9.46 24.56
Mask [59] 11.10
JEDDi [51] 8.58 19.88 19.63
DVC [22] 10.33 25.24
Our 8.25 37.58 20.24
Table 4: ActivityNet dense captioning results using our method presented in Section 3.5. M stands for METEOR, R for ROUGE-L and C for CIDER.

4.6 Downstream Task 2: Action localization

Our second downstream task is action localization in Charades dataset. We use GRA-AA model trained in section 4.4 solve this. Following action localization protocol in [29, 9, 34], we classify 25 equal distant frames and obtain the localization scores and mAP. As we do not use precise temporal annotations (as done in [29, 34]) to train, and therefore, it is a moderately supervised method. We report results in Table 5.

Supervision Method Features mAP
Supervised Temporal Fields [34] VGG16 (RGB+OF) 12.8
Two Stream++ [37] VGG16 (RGB+OF) 10.9
Super-Events [29] I3D (RGB) 18.6
Super-Events [29] I3D (RGB+OF) 19.4
LSTM I3D (RGB+OF) 18.1
TGM [30] I3D (RGB+OF) 21.5
Weakly WSGN [9] I3D (RGB) 18.3
Moderately Our I3D (RGB) 20.1
Our I3D (RGB+OF) 22.2
Table 5: Compare to state-of-the for action localization on Charades. Methods use only RGB and Optical flow (OF).

Compared state-of-the-art methods such as Super-Events [29] and TGM [30], we obtain quite reasonable results without using precise temporal annotations during training. It is also better than recent weakly supervised WSGN [9]. Our GRU-AA outperforms new TGM [30] 111combination of TGM and Super-Events obtain 22.3 [30]. method by 0.7 map. This is because our method explicitly model correlations between actions and better align action sequence with the feature sequence using an attensional mechanism. These results suggest the power of action sequence classification task for solving downstream tasks such as action localization and video captioning.

5 Discussion and Conclusion

In this paper we presented a new task called video action sequence classification to output a sequence of actions instead of set of action as done in action recognition literature. We formulate two solutions to this problem using neural machine translation. We obtain interesting and encouraging results on three difficult action recognition datasets, the MPII Cooking, Charades and ActivityNet datasets.

Secondly, we extend action sequence classification for solving other downstream tasks, specifically, for video captioning and action localization. We obtained significant improvements over prior state-of-the art results in video captioning and action localization on Charades dataset. Similarly, we obtain significant results on ActivityNet captioning for CIDEr and ROUGE-L. In the future, in an extended paper we demonstrate on other downstream tasks to convince the community that ”action sequence classification” is worthy of investigating. Finally, we will release the codes and model for the community to make use of them.
Acknowledgment:This research is supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-RP-2019-010).


  • [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §1, §1.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.4.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 6299–6308. Cited by: §3.2, §4.3.
  • [4] C. Chang, D. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles (2019-06) D3TW: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634. Cited by: §2, §2.
  • [6] Y. Du, W. Wang, and L. Wang (2015-06)

    Hierarchical recurrent neural network for skeleton based action recognition

    In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [7] B. G. Fabian Caba Heilbron and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970. Cited by: §4.1, §4.1.
  • [8] R. Fakoor, A. Mohamed, M. Mitchell, S. B. Kang, and P. Kohli (2016) Memory-augmented attention modelling for videos. arXiv preprint arXiv:1611.02261. Cited by: §4.5, Table 3.
  • [9] B. Fernando, C. T. Y. Chet, and H. Bilen (2019) Weakly supervised gaussian networks for action detection. arXiv preprint arXiv:1904.07774. Cited by: §1, §2, §4.6, §4.6, Table 5.
  • [10] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes (2019) Forecasting future action sequences with neural memory networks. British Machine Vision Conference (BMVC). Cited by: §2.
  • [11] M. Hasan and A. K. Roy-Chowdhury (2015-12)

    Context aware active learning of activity recognition models

    In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • [13] N. Hussein, E. Gavves, and A. W. Smeulders (2019) Timeception for complex action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 254–263. Cited by: §1.
  • [14] S. Ji, W. Xu, M. Yang, and K. Yu (2012)

    3D convolutional neural networks for human action recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §1.
  • [15] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar (2014) THUMOS challenge: action recognition with a large number of classes. Note: Cited by: §1.
  • [16] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.3.
  • [17] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), pp. 706–715. Cited by: §4.5, §4.5, Table 4.
  • [18] T. Lan, T. Chen, and S. Savarese (2014) A hierarchical representation for future action prediction. In European Conference on Computer Vision (ECCV), pp. 689–704. Cited by: §1.
  • [19] I. Laptev (2005) On space-time interest points. International Journal of Computer Vision 64 (2-3), pp. 107–123. Cited by: §1.
  • [20] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager (2017) Temporal convolutional networks for action segmentation and detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 156–165. Cited by: §1.
  • [21] X. Li, B. Zhao, X. Lu, et al. (2017) MAM-rnn: multi-level attention model based rnn for video captioning.. In IJCAI, pp. 2208–2214. Cited by: §4.5, Table 3.
  • [22] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei (2018) Jointly localizing and describing events for dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7492–7500. Cited by: Table 4.
  • [23] A. Liu, Z. Shao, Y. Wong, J. Li, Y. Su, and M. Kankanhalli (2019) LSTM-based multi-label video event detection. Multimedia Tools and Applications 78 (1), pp. 677–695. Cited by: §2.
  • [24] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han (2019) Streamlined dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6588–6597. Cited by: §1.
  • [25] P. Nguyen, T. Liu, G. Prasad, and B. Han (2018-06) Weakly supervised action localization by sparse temporal pooling network. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [26] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. pp. 311–318. Cited by: §4.2.
  • [27] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-talc: weakly-supervised temporal activity localization and classification. In European Conference of Computer Vision (ECCV), pp. 563–579. Cited by: §1, §2.
  • [28] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    pp. 1532–1543. External Links: Link Cited by: §3.5.
  • [29] A. Piergiovanni and M. S. Ryoo (2018) Learning latent super-events to detect multiple activities in videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5304–5313. Cited by: §4.6, §4.6, Table 5.
  • [30] A. Piergiovanni and M. Ryoo (2019) Temporal gaussian mixture layer for videos. In

    International Conference on Machine Learning (ICML)

    pp. 5152–5161. Cited by: §4.6, Table 5, footnote 1.
  • [31] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele (2012) A database for fine grained activity detection of cooking activities. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201. Cited by: §4.1, §4.1.
  • [32] Q. Shi, L. Wang, L. Cheng, and A. Smola (2008)

    Discriminative human action segmentation and recognition using semi-markov model

    In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. Cited by: §2.
  • [33] Y. Shi, B. Fernando, and R. Hartley (2018) Action anticipation with rbf kernelized feature mapping rnn. In European Conference on Computer Vision (ECCV), pp. 301–317. External Links: Link Cited by: §1.
  • [34] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta (2017) Asynchronous temporal fields for action recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.6, Table 5.
  • [35] G. A. Sigurdsson, O. Russakovsky, and A. Gupta (2017-10) What actions are needed for understanding human actions in videos?. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • [36] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV), Cited by: §4.1, §4.1.
  • [37] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576. Cited by: §1, Table 5.
  • [38] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao (2016-06) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
  • [39] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §2, §3.1, §3.3.
  • [40] Y. Tian, R. Sukthankar, and M. Shah (2013) Spatiotemporal deformable part models for action detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2649. Cited by: §1.
  • [41] V. Veeriah, N. Zhuang, and G. Qi (2015-12) Differential recurrent neural networks for action recognition. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [42] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015-12) Sequence to sequence - video to text. In International Conference on Computer Vision (ICCV), Cited by: Table 3.
  • [43] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In International Conference on Computer Vision (ICCV), pp. 4534–4542. Cited by: §1.
  • [44] H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), pp. 3551–3558. Cited by: §1.
  • [45] L. Wang, Y. Xiong, D. Lin, and L. Van Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [46] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §3.2.
  • [47] W. Wang, Y. Huang, and L. Wang (2019)

    Language-driven temporal activity localization: a semantic matching reinforcement learning model

    In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 334–343. Cited by: §1.
  • [48] X. Wang, W. Chen, J. Wu, Y. Wang, and W. Yang Wang (2018-06) Video captioning via hierarchical reinforcement learning. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5, §4.5, Table 3.
  • [49] C. Wu, J. Zhang, S. Savarese, and A. Saxena (2015-06) Watch-n-patch: unsupervised understanding of actions and relations. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [50] X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin (2018-06) Interpretable video captioning via trajectory structured localization. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5, §4.5, Table 3.
  • [51] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko (2019) Joint event detection and description in continuous video streams. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 396–405. Cited by: Table 4.
  • [52] Z. Xu, L. Qing, and J. Miao (2015-12) Activity auto-completion: predicting human activities from partial videos. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [53] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville (2015-12) Describing videos by exploiting temporal structure. In International Conference on Computer Vision (ICCV), Cited by: Table 3.
  • [54] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016-06) End-to-end learning of action detection from frame glimpses in videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [55] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end learning of action detection from frame glimpses in videos. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2678–2687. Cited by: §1.
  • [56] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu (2016) Video paragraph captioning using hierarchical recurrent neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4584–4593. Cited by: §1.
  • [57] B. Zhao, X. Li, X. Lu, et al. (2018) Video captioning with tube features.. In IJCAI, pp. 1177–1183. Cited by: §4.5, Table 3.
  • [58] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §3.2.
  • [59] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong (2018) End-to-end dense video captioning with masked transformer. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8739–8748. Cited by: Table 4.