1 Related Work
In this section, we will first review different approaches to action detection. We will then discuss recent works related to our approach, with a focus on action detection and image generation. Before deep learning, several issues, such as view-independency
Before deep learning, several issues, such as view-independency[Roh2010], multi-modality [Maeng2012], and considering human identity [Park2013], have been challenges for action recognition task. For further information, we refer readers to [poppe2010survey].
Action detection: Action detection is a new field when compared to the action classification task. In recent work action classification, Ma et al. [MA2017334] studied utilizing web images for better model generalization. Ijjina and Chalavadi [IJJINA2016199]
exploited a genetic algorithm to train CNN models efficiently. In the early stages of action detection, Hoai et al.[hoai2011joint] performed joint segmentation and recognition on concatenated trimmed videos. Gaidon et al. [gaidon2013temporal] localized actions temporally within the Coffee and Cigarettes dataset, which contains untrimmed video with two classes. Since then, Shou et al. [shou2016temporal] have studied for action detection on untrimmed video datasets, such as THUMOS’14 [THUMOS14] and ActivityNet [caba2015activitynet]. With excellent results in the THUOMS’14 competition, Oneata et al. [oneata2014lear] and Wang et al. [wang2014action]
employed improved dense trajectories (IDT) encoded by fisher vectors (FVs) and CNN features, as well as SVM classifiers based on sliding windows with some variations, such as window sizes, fusion methods, post-processing methods. Inspired by these works, Shou et al.[shou2017cdc] proposed multi-stage CNNs that consider spatial-temporal information and a convolution-de-convolutional (CDC) network that performs spatial downsampling and temporal upsampling to capture abstractions for action semantics and temporal dynamics. Yuan et al. [yuan2016temporal] captured temporal context by proposing a pyramid of score distribution features (PSDF) descriptor. Yeung et al. [yeung2016end]
proposed a frame-wise end-to-end framework with a reinforced learning. Modeling actions in a grammatical form through N-grams and latent Dirichlet allocation (LDA) was performed as another method of action detection[richard2016temporal].
Meanwhile, some works considered action detection not only temporally but also spatially. Lan et al. [lan2015action] and Weinzaepfel et al. [weinzaepfel2015learning] proposed a spatio-temporal action localization (or parsing) method by representing mid-level action elements using a hierarchical structure and localizing actions with a spatio-temporal motion histogram (STMH) descriptor individually at the track level. Additionally, Lea et al. [lea2016segmental] detected fine-grained actions in ego-centric video datasets using graphical models based on tracking hands and objects or using deep neural networks.
While most of the methods mentioned above focus on offline settings, several methods performed online action detection by predicting an action’s ending point using temporal priors or unsupervised learning. Recently, new datasets for online action detection have been proposed. Geest et al.[de2016online] introduced an RGB-based TVSeries dataset with baseline models using long short-term memory (LSTM) and a CNN. Li et al. [li2016online] introduced a skeleton-based online action dataset (OAD) and proposed a joint classification regression method using LSTM.
Temporal sliding window-based detection: It is crucial to consider the temporal contextual information of time series data, such as video and sound. In action detection literature, many methods extensively employed the temporal sliding window method, which is performed by moving a window of a specific size over time. To make the most of temporal contextual information, [shou2016temporal, yuan2016temporal] used multi-scale sliding windows. Features are extracted using windows of various scales (e.g., 16, 32, or 64 frames), and the detection results of each scale are post-processed to generate a final prediction. However, this approach is unsuitable for online action detection because only a limited amount of information is available. In this paper, despite the success achieved through the use of multi-scale windows, we employ the single-scale window approach.
Image generation: Several approaches have been proposed for generating target images from given images. Kingma et al. [kingma2013auto] proposed a variational inference-based method by extending the auto-encoder structure. Dosovitskiy et al.  introduced a method applying CNN structure for object-oriented image generation. Additionally, Radford et al. [radford2015unsupervised] exploited generative adversarial networks (GANs) [goodfellow2014generative], which reconstruct distributions of data through discriminate networks and adversarial networks. Based on these approaches, several video frame generation approaches have been proposed. Finn et al. [finn2016unsupervised] proposed a method that generates future frames through a combination of RNNs and auto-encoding (called a variational auto-encoder). Mathieu et al. [mathieu2015deep] and Vondrick et al. [vondrick2016generating] proposed video frame generation methods exploiting deep convolutional GANs [radford2015unsupervised], respectively. Most recently, Villegas et al. [villegas2017decomposing] proposed a motion-content network (MCnet) that considers temporal and spatial information separately, which resembles the way human brain processes temporal and spatial information [Bulthoff2002]. In this paper, in order to resolve the limited information issue in online settings, we adopt MCnet for generating future frames.
In this section, we first overview the proposed framework. Then, we detail our proposed framework, including the architectures of the deep neural networks used in each component. We also elaborate on how to train the networks on a large-scale dataset.
The objective of our framework is to detect actions in untrimmed video streams. The framework is composed of four deep networks: a proposal representation (PR) network that discriminates between actions and background scenes (Sec. 2.1), action representation (AR) network that predicts the type and temporal order of an action (Sec. 2.2), future frame generation (FG) network that generates future video frames (Sec. 2.3), and detection network that detects actions by receiving outputs from other networks (Sec. 2.4). Fig. 2 illustrates the pipeline of the proposed framework.
Motivations for choosing these networks are as follows. Unlike the sole action classification task, action detection from untrimmed videos requires action representations dedicated to not only action itself but also background scenes. Intuitively, visual treats of background scenes and actions are different. Thus, in the proposed framework, we exploit two deep networks to solve two different tasks: one is to distinguish background scenes from actions, and the other is to classify actions of interest, e.g., twenty classes for THUMOS’14. Both networks have the same structures, 3D convolutional layers followed by fully connected layers (Fig. 3), which have shown outstanding performance for the action classification. When it comes to online situations, as described in Sec. 1, the action localization task suffers from the short of information to solve the problem. In order to resolve this issue, we propose using the future frame generation network. The detection network is composed of LSTM layers to mode temporal correlations and capture temporal changes locally, such as motion features.
2.1 Detecting Action Candidate Spots
Untrimmed videos are an irregular combination of actions and background scenes. Under this situation, it is necessary that detecting candidate segments where an action is likely to occur, i.e., distinguishing between actions and background scenes. Towards this goal, we train a proposal representation (PR) network. The PR network takes a video segment as an input, which is acquired by a temporal sliding window of length . The network is trained to classify binary classes – background scene class and action scene class
– via a 3DCNN whose final fully connected layer has two neurons.
Network architecture: We employ a 3DCNN to consider spatial and temporal information simultaneously. Different from widely used conventional 2D CNNs, our 3DCNN learns motion context, which is an important clue in video analysis, by considering adding a time axis to the 2D image coordinate system. We adopt the 3DCNN architecture [tran2015learning]: 8 convolutional layers, 5 pooling layers, and 2 fully connected layers, and an output layer with a softmax function. Details of the architecture are shown in Fig. 3 (top row).
2.2 Learning Visual Traits from a Temporal Order
The beginning and ending phases of all action instances in the same action class share an identical trait. E.g., assuming there are videos containing scenes where a person is throwing a baseball, the beginning phase of all videos would contain the person making a wind-up posture before throwing the ball, and the ending phase would contain the person leaning forward and lifting his leg back after throwing the ball. However, duration of these phases are different per instance (see Fig. 4). In order to capture this trait, we design an action representation (AR) network considering actions as a set of temporally ordered subclasses. Specifically, we divide each action class into the beginning and ending phase classes, and train a 3DCNN to classify these subclasses.
Learning temporally ordered subclasses allows the model to represent the time phase of an action using only visual information. When compared to methods that exploit the average length of each action as temporal priors to detect actions, e.g., [li2016online], our method detects only from a given input sequence.
Network architecture: The AR network has the architecture identical to the PR network, except the last fully connected layer consists of neurons where is the number target classes, e.g., for THUMOS’14 and for ActivityNet. Details of the architecture of the AR network are shown in Fig. 3 (bottom row).
2.3 Generating Future Frames
The major limitation in online settings is that only past and current information can be considered for decision. To overcome this limitation, we introduce a future frame generation (FG) network that generates future frames. In this paper, we design an FG network with the same architecture as the MCnet proposed in [villegas2017decomposing], which considers spatial and temporal information by modeling content and encoding motion, respectively.
Network architecture: The FG network is an encoder-decoder network composed of two different networks: a content encoder with a CNN architecture and motion encoder with a convolutional LSTM architecture. Generated samples are illustrated in Fig. 5. We refer to [villegas2017decomposing] for further details.
2.4 Detecting Actions by Modeling Temporal Correlations
To detect actions, to model temporal correlations and to capture temporal changes locally, such as motion features, are an essential factor. However, the PR and AR networks employing 3DCNN are lack of this ability. Therefore, we design our detection network by employing a recurrent neural network (RNN) that can model temporal correlations. The network takes the outputs from each fully connected layer (fc) of the PR and AR networks as input (see Fig. 6). The detection network uses the outputs of other networks to reflect the response (opinion) of each network (expert) for a given input data sample over time; it then derives final results by modeling temporal correlations from each network using the RNN.
: There are various types of RNNs, such as long short-term memory (LSTM) and gated recurrent units (GRUs). In this paper, the detection network consists of a dropout layer with the probability 0.5, two LSTM layers, each of which has 128 states, a dropout layer with the probability 0.5, and a fully connected layer havingneurons, which correspond to action classes and one background class of a target dataset, e.g., for THUMOS’14 and for ActivigyNet. Fig. 7 shows details of the architecture of the detection network.
PR and AR networks: For training PR and AR networks, we first initialize the weights of all convolution layers (Conv1a to Conv5b) and the first fully connected layer (fc6) with the pre-trained 3DCNN network [tran2015learning]. Then we fine-tune these networks with the target benchmark dataset, either THUMOS’14 or ActivityNet. To train the PR network, we modified labels as two: foreground (action) and background (non-action). To train the AR network, we divide each action class into two subclasses, beginning and ending, thus 40 classes for THUMOS’14 and 400 classes for ActivityNet.
In experiments, we use SGD (stochastic gradient descent) optimization with a learning rate of 0.0001, momentum of 0.9, weight decay factor of 0.0005, and dropout probability of 0.5. We use the cross-entropy as the loss function to update the network weights.
where is the total number of training samples, is the -th element value of the ground truth probability of the -th sample. is the -th element value of the predicted probability (network output) of the -th sample. is the total number of classes: 2 for the PR network and the number of action classes of a target dataset for the AR network.
FG network: We fine-tune the FG network on the action classes of a target dataset. The FG network uses the loss function composed of different sub-losses: an image loss and generator loss (we refer to [villegas2017decomposing] for further details). In experiments, we fine-tune the network with a learning rate of 0.001.
Detection network: We use SGD (stochastic gradient descent) optimization with a learning rate of 0.0001, momentum of 0.9, weight decay factor of 0.0005, and dropout probability of 0.5. We use the cross-entropy as the loss function to update the network weights. Additionally, we use classes weight to consider imbalanced instance number among classes as follows:
where is the number of training instances of -th class and
is the largest instance number among all classes. In experiments, we use RMSProp optimization with a learning rate of 0.0001 and a dropout probability of 0.5. We use the cross-entropy as the loss function.
2.6 Data Augmentation
We introduce a video data augmentation technique for further improvement of the single-scale temporal sliding window method. There are some issues in both the single-scale window and multi-scale window methods. When using single-scale windows, they cannot capture areas that are important for representing an action if the length of the window is set improperly. When using multi-scale windows, they require post-processing techniques that are unsuitable for online settings. Therefore, we augment training data by varying the lengths of videos. This augmentation allows us to obtain effects similar to using multi-scale windows even though we only use single-scale windows.
We conduct augmentation in two ways: increasing and decreasing. The former is to simulate a video clip being played faster by sampling some frames from the original video. The later is to simulate a video clip being played slower. This effect is performed by motion interpolation using the butterflow algorithm333More details for the butterflow algorithm, and its implementation can be found at https://github.com/dthpham/butterflow. Motion interpolation is performed by rendering intermediate frames between two frames based on motion. Specifically, given two frames and , this technique fills the space between these two frames with generated intermediate frames , , …, , as shown in Fig. 8.
In general, data augmentation helps a model to better generalize [Krizhevsky_imagenet]. With the proposed data augmentation, a model sees training videos with different temporal resolutions. This augmentation mimics, to a certain extent, the effect when using a multi-scale window. As a multi-scale window would learn from having windows of different temporal scales as input during training time, with the augmented data, a single-scale window learns from temporal variations during training time in a different way.
We evaluated the proposed framework on two large benchmark datasets: THUMOS’14 [THUMOS14] and ActivityNet [caba2015activitynet]. We will first describe the experimental setup, compare the performance with other methods on two benchmark datasets, and then demonstrate the ablation study. Finally, we describe the limitations of the proposed framework.
3.1 Experimental Setup
Implementation details: The length of the temporal sliding window is 16. During the training phase, there is no overlap between sliding windows. In other words, sliding windows at time and do not have an intersection between them. In the test phase, we allow 50% overlap between adjacent sliding windows. At time , the PR and AR networks take 16 frames () as an input. The FG network generates 8 future frames () from the input frames. The generated 8 frames are concatenated with past input frames (). Then these concatenated 16 frames are passed to the PR and AR networks (Fig. 2). The detection network receives a dimensional vector as an input which corresponds to the output of the PR network (40962, 4096 from the input frames and 4096 from the generated frames concatenated with the second half of input frames) and AR network (40962, same as the PR network).
The PR and AR networks were trained with 50 epochs on each training and validation sets of THUMOS’14. The FG network was trained with 150k epochs on the training set. Training the detection network is a nontrivial part. We first train the network with 100 epochs on the training set, then 100 epochs on the validation set, and we repeat three times. The training batch is organized as 8 time steps, i.e., 8 sliding windows to allow the network to learn the long-term temporal relationship of input frames. In the test phase, a single sliding window is passed to the detection network. During training, we augment the data by increasing the speed by two and decreasing by half.
Evaluation metric: We use interpolated average precision (AP) and mean average precision (mAP) to evaluate the performance of our model following the guidelines of the action detection task of THUMOS’14. A detection result is evaluated as a true positive when the overlapping intersection over union (IoU) between the predicted temporal range and ground truth temporal range is larger than an overlap threshold .
3.2 Experimental Results on THMOS’14
Dataset: We consider twenty classes of THUMOS’14 dataset for evaluation: BaseballPitch, BasketballDunk, Billiards, CleanAndJerk, CliffDiving, CricketBowling, CricketShot, Diving, FrisbeeCatch, GolfSwing, HammerThrow, HighJump, Javelin-Throw, LongJump, PoleVault, Shotput, SoccerPenalty, TennisSwing, Throw-Discus, and VolleyballSpiking. The training set consists of 2,765 trimmed videos that contain one action. The background set consists of 2,500 untrimmed videos that include actions in undefined categories. The validation and test sets consist of 1,010 untrimmed videos and 1,574 untrimmed videos, respectively, which contain more than one action instance, including backgrounds.
Comparison to other methods: We evaluate our model and compare its performance to various offline action detection methods: Wang et al. [wang2014action], Oneta et al. [oneata2014lear], Yeung et al. [yeung2016end], Richard et al. [richard2016temporal], Shou et al. [shou2017cdc], and Zhao et al. [zhao2017temporal] using THUMOS’14 because there are no online methods reported on this dataset. As a baseline, we implemented a framework identical to the proposed method except FG is removed and trained without data augmentation. We report the performance with mAP metric (3) with the threshold ranging from 0.1 to 0.7.
Tab. 1 summarizes the performance of methods on THUMOS’14. Note that the proposed method was tested on the online setting scenario in which only past and current information are available for decision at a given moment, while other comparing methods were tested on the offline setting in which rich temporal context information is available. Taking this into consideration, the performance of our method is comparable to other offline setting methods. Compared to the baseline, the proposed method outperforms significantly, 0.06 higher mAP on average. Fig. 10 visualizes results in some challenging test data instances on THUMOS’14.
3.3 Experimental Results on ActivityNet
Dataset: We use the latest version 1.3 of ActivityNet dataset, which contains two hundred activity classes. The training set consists of 10,024 untrimmed videos that contain single activity instances. The validation set and test set consist of 4,926 and 5,044 untrimmed videos, respectively. In this experiment, we use the validation set for evaluation since the annotation of the test set is not available.
Comparison to other methods: We evaluate our model and compare its performance to two offline action detection methods: Heilbron et al. [Heilbron_2017_CVPR], Shou et al. [shou2017cdc], and Zhao et al. [zhao2017temporal] using the ActivityNet because there are no online methods reported on this dataset. As a baseline, we implemented a framework identical to the proposed method except FG is removed and trained without data augmentation. We report the performance with mAP metric (3) with the threshold s: 0.5, 0.75, and 0.95 as in [Heilbron_2017_CVPR].
Tab. 2 summarizes the performance of methods on ActivityNet. Note that the proposed method was tested on the online setting scenario in which only past and current information are available for decision at a given moment, while other comparing methods were tested on the offline setting in which rich temporal context information is available. Taking this into consideration, the performance of our method is comparable to other offline setting methods. Compared to the baseline, the proposed method outperforms significantly, 0.06 higher mAP on average. Fig 11 visualizes results in some challenging test data instances on ActivityNet.
3.4 Additional Analysis
Performance in online setting: Since the proposed method works in the online setting, it is not straightforward to directly compare the performance with methods based on the offline setting. To our best knowledge, there is no metric for the online action detection performance. Thus, we evaluate the per-frame mAP performance as exploited by Shou et al. [shou2017cdc]. Tab. 3 shows the performance on THUMOS’14. The proposed method outperforms Shou et al. by 0.01.
Computational complexity: We tested the proposed framework on a single NVIDIA Titan X GPU with 12GB memory. The speed of the proposed framework is around 9 frames per second (fps). Assuming that each C3D network in the proposed framework has three convolutional layers streams to deal with different temporal resolutions, similar to multi-scale methods, the fps of this configuration decreases to 7 fps, which implies that the proposed data augmentation allows less computational cost and efficient memory usage.
3.5 Ablation Study
We conduct additional experiments to analyze the impact of each model component by eliminating them one at a time. The experiments are conducted with six model setups: i) baseline, ii) without data augmentation (w/o Aug), iii) the C3D network connected after the C3D network (w/ CS), iv) without the FG network (w/o FG), v) the full model (Full), and vi) ground truth as future frame generation output (w/ FG GT). Tab. 4 summarizes the result on THUMOS’14 dataset and Tab. 5 the result on ActivityNet.
Baseline: This setup is when the proposed framework consists of PR, AR, and Det networks. The performance is shown in the second row of Tab. 4 and Tab. 5. Its performance is inferior to all other settings.
Data augmentation: Without data augmentation for model training, the performance decreases by 0.01 on average on both datasets (the third row). This result indicates that with proper data augmentation, including the frame interpolation used in the proposed framework, the performance can improve more.
Proposal representation and action representation configuration: This setup is to study which of two C3D network arrangements, in parallel or serial, is useful in the proposed framework. As exploited in [shou2016temporal], the C3D network only takes the input segment classified as action by the C3D network. In this setup, the performance decreases, on average, by 0.02 and 0.01 on THUMOS’14 and ActivityNet, respectively (the fourth row).
Future frame generation: Exploiting future frame generation component increases the mAP, on average, by 0.05 on THUMOS’14 and 0.02 on ActivityNet, respectively (the fifth row). The performance gain is the most significant among other components. The result demonstrates that using the FG network allows our framework to consider more amount of information when compared to a situation without the FG. Thus, the limitations in the online setting is resolved to a certain extent.
Ground truth as future frame generation output: To simulate this setup, we replace the output of the FG network by the ground truth frames. In other words, given an input sequence the FG generates future frames ; we replace generated future frames by the actual future frames to simulate the aforementioned situation. The results are shown in the bottom row of Tab. 4 and Tab. 5. Comparing to the performance of the ‘full’ model (the sixth row), using the ground truth as outputs of the FG network (the bottom row) increases the mAP, on average, by 0.01 on THUMOS’14 and 0.006 on ActivityNet. This result indicates that improving future frame generation performance leads to detection performance increase.
To summarize, as we argued in Sec. A Novel Online Action Detection Framework from Untrimmed Video Streams, for the online action detection scenario from video streams, a limited amount of information is a significant factor; using the FG network resolves this limitation by feeding predicted future input frames of a short period, eight frames in this paper, to the system so that more information is considered. Augmenting data also improves the detection performance, which means that making a model aware of variation of action duration is critical to a certain extent. Arranging two C3D networks in parallel, instead of connecting them in serial, is more effective in the proposed framework.
We demonstrated that the proposed method shows comparable performances on two benchmark datasets. However, there are several limitations that we summarize as follows.
Computational complexity: The proposed framework exploits four deep neural networks, which costs roughly 174M parameters. The reason is due to the difficulties of the online action localization task, which requires several components to deal with the lack of available information, distinguishing actions from background scenes, and accurately localize the start and the end of an action class. Thus, we designed the proposed framework with four deep neural networks in each of which is dedicated to deal with the issues mentioned above.
Limited backpropagation during training: This limitation comes from the limited GPU resources to handle all four networks at the same time. As described in Sec. 3, each network of the proposed framework is trained separately, which implies that detection error at the final network LSTMdoes not backpropagate to the input layer of AR, PR, and FG networks, respectively.
Dependency on the FG network: We demonstrated that using generated future frames improves the online temporal action localization performance. However, the generation performance is not satisfactory when compared to the real ground truth frames. There is a large room for improving generation performance in which the proposed method depends.
Room for further improvement: As we mentioned above, the limitations of the proposed method mostly come from the hardware side. We expect that, with enough computational resources, training the proposed framework with proper backpropagation, the performance will improve to a certain extent.
4 Conclusion and Future work
In this paper, we proposed a novel action detection framework to address the challenging problems of online action detection from untrimmed video streams. To resolve the limited information issue, we proposed to exploit a future frame generation network. To learn temporal order using only visual information without learning any temporal prior, such as duration of action, we reorganized action class as two temporally-ordered subclasses. To make the proposed framework generalize better, we augmented training video data by varying the duration of action.
We demonstrated that the performance of the proposed framework is comparable with the offline setting methods on two benchmark datasets, THUMOS’14 and ActivityNet. Through the ablation study, we demonstrated that the FG network gives meaningful improvement. We believe that other time-series tasks, such as traffic flow prediction [POLSON20171] and financial market analysis [CHONG2017187], can also be benefitted by using a future generation network. In the meanwhile, there are also several limitations. The dependency on the future frame generation network and computational complexity of the proposed framework need to be addressed for further improvement.
As future work, we plan to design a more efficient feature extraction network so that the whole framework can learn with the same backpropagation error. We will also plan to formulate action detection as a multitask learning problem.