Comparative Analysis of CNN-based Spatiotemporal Reasoning in Videos

Understanding actions and gestures in video streams requires temporal reasoning of the spatial content from different time instants, i.e., spatiotemporal (ST) modeling. In this paper, we have made a comparative analysis of different ST modeling techniques. Since convolutional neural networks (CNNs) are proved to be an effective tool as a feature extractor for static images, we apply ST modeling techniques on the features of static images from different time instants extracted by CNNs. All techniques are trained end-to-end together with a CNN feature extraction part and evaluated on two publicly available benchmarks: The Jester and the Something-Something dataset. The Jester dataset contains various dynamic and static hand gestures, whereas the Something-Something dataset contains actions of human-object interactions. The common characteristic of these two benchmarks is that the designed architectures need to capture the full temporal content of the actions/gestures in the correct order. Contrary to expectations, experimental results show that recurrent neural network (RNN) based ST modeling techniques yield inferior results compared to other techniques such as fully convolutional architectures. Codes and pretrained models of this work are publicly available.


page 1

page 3


Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks

Real-time recognition of dynamic hand gestures from video streams is a c...

Resource Efficient 3D Convolutional Neural Networks

Recently, convolutional neural networks with 3D kernels (3D CNNs) have b...

TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

Temporal Reasoning is one important functionality for vision intelligenc...

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

The purpose of this study is to determine whether current video datasets...

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...

Spatiotemporal CNNs for Pornography Detection in Videos

With the increasing use of social networks and mobile devices, the numbe...

A Deep Learning Framework for Recognizing both Static and Dynamic Gestures

Intuitive user interfaces are indispensable to interact with human centr...

1. Introduction

Deep learning has been successfully applied in the area of image processing, providing state of the art solutions for many of its problems such as super-resolution (Ledig et al., 2017), image denoising (Liu et al., 2018), and classification (Deng et al., 2009). Due to the outstanding performance of two-dimensional (2D) convolutional neural networks (CNNs) on processing static images, many attempts have been made to generalize 2D CNN architectures to capture the spatiotemporal (ST) structure of videos (Simonyan and Zisserman, 2014; Wang et al., 2016). Until recently, 2D CNNs were the only options for video analysis tasks since lack of large scale video datasets made it impossible to train 3D CNNs properly.

Figure 1. Spatio-Temporal Modeling Architecture: One input video containing an action/gesture is divided into N segments. Afterwards, equidistant frames (m,m, .. m) are selected from the segments and fed to a 2D CNN for feature extraction. Extracted features are transformed to a fixed size via one fully connected layer and fed to a ST modeling block. This block produces the final class score of the input video. In this example, action of ”taking something from somewhere” is depicted.

With the availability of large scale video datasets such as Kinetics (Carreira and Zisserman, 2017), deeper and wider 3D CNN architectures can be successfully trained to achieve better performance compared to 2D CNNs (Hara et al., 2018). More importantly, 3D CNNs can capture the ST patterns in videos inherently without requiring additional mechanisms. However, their drawback is that the input size should always remain the same for 3D CNNs such as 16 or 32 frames, which makes them not suitable for capturing temporally varying actions. This is not a problem for activity recognition tasks for Kinetics (Carreira and Zisserman, 2017) or UCF-101 (Soomro et al., 2012)

, as videos can be successfully classified using even very small snippets of the complete video. However, there are tasks where the designed architectures need to observe the complete video at once in order to make successful decisions. For these tasks, 2D CNN based architectures are still useful as a complete video can be sparsely sampled with a desired number of segments and features of the selected frames can be extracted. Still, these architectures need an extra mechanism to provide ST modeling of the extracted features.

This work aims to analyze and compare various techniques for ST modeling of the features extracted by a 2D CNN from sparsely sampled frames of action/gesture videos. Fig. 1

depicts the analyzed ST modeling architecture. A complete action/gesture video is divided into a predefined number of segments. From each segment, a frame is selected (randomly in training and equidistant in testing) and fed into the 2D CNN to extract its features. In order to understand which type of action/gesture is performed, an ST modeling technique is used. In this work, we have analyzed multi-layer perceptron (MLP) based techniques such as simple MLP, Temporal Relational Network (TRN) and Temporal Segment Network (TSN), recurrent neural network (RNN) based techniques such as vanilla RNN, gated recurrent unit (GRU), long short-term memory (LSTM) and bidirectional LSTM (B-LSTM) techniques, and finally fully convolutional technique.

The proposed ST modeling techniques are evaluated on two publicly available benchmarks: (i) The Jester dataset that contains dynamic and static hand gesture videos, (ii) the Something-Something dataset that contains videos of various human-object interactions. The common aspect of both these videos is that the proposed recognition architectures need to analyze the full content of the video in order to make a successful recognition, which makes them perfect benchmarks for analyzing ST modeling techniques.

The rest of the paper is organized as follows. In Section 2, we present related work in 2D CNN based action/gesture recognition. Section 3 explains the details of the analyzed ST modeling techniques. Section 4 presents the experiments and results. Finally, Section 5 concludes the paper.

2. Related Work

Deep learning architectures for ST modeling have been extensively studied in recent years, particularly in the context of action and gesture recognition (Karpathy et al., 2014; Simonyan and Zisserman, 2014; Zhou et al., 2018; Wang et al., 2016). Karpathy et al. (Karpathy et al., 2014) suggest several CNN architectures that fuse information across the temporal domain and applied the resulting models to the Spots-1M classification and UCF Action Recognition data sets. To speed up the training, they proposed a CNN-based multi-resolution architecture that could slightly improve the final results. Two stream CNNs (Simonyan and Zisserman, 2014; Feichtenhofer et al., 2016) fuse a spatial network processing the video frames with a temporal network using optical flow to obtain a common class score. These methods rely on separately processing the spatial and temporal components of the video, which can be a disadvantage. 3D convolutional neural networks, on the other hand, can be used to inherently learn the spatiotemporal structure of videos (Hara et al., 2018; Köpüklü et al., 2019). Tran et al. (Tran et al., 2015) apply a 3D CNN architecture to obtain spatiotemporal feature volumes of input videos. To reduce training complexity, Sun et al. (Sun et al., 2015) propose a factorization of 3D spatiotemporal kernels into sequential 2D spatial kernels and separately handle sequence alignment. Although a sparse sampling strategy can be applied to the input value to span a larger time duration (Köpüklü and Rigoll, 2018), all 3D architectures have the disadvantage that the input size needs to be fixed, which limits their capability of handling data sets with varying video lengths.

Recurrent neural networks are a natural choice for processing dynamic length video sequences, and several modern architectures have been proposed for action recognition in videos. Donahue et al. (Donahue et al., 2015) employ an LSTM after CNN-based feature extraction on the individual frames to learn spatiotemporal components and apply the architecture on the UCF Action Recognition data set. Similarly, Baccouche et al. (Baccouche et al., 2011) use 3D convolutional neural networks together with an LSTM network. Liu et al. (Liu et al., 2016) suggest to modify the Vanilla LSTM architecture to learn spatiotemporal domains. Another recurrent method is the Differentiable RNN (Veeriah et al., 2015) generated by salient motion patterns in consecutive video frames.

Newer methods like Temporal Segment Networks (Wang et al., 2016) enable processing longer videos by segmenting the input video into a certain number of segments, selecting short-length snippets randomly from each segment and finally fusing individual prediction scores. These prediction scores are the result of a spatial convolutional network operating on the samples frames and a temporal convolutional network operating on optical flow components. Similarly, Temporal Relation Networks (Zhou et al., 2018) extract a number of ordered frames from the input video, which are then passed through a convolutional neural network for feature extraction.

3. Methodology

Figure 2. Illustration of Temporal Relation Networks. Features extracted from different segments of a video by a 2D CNN are fed into different frame relation modules. Only a subset of the 2-frame, 3-frame, and 4-frame relations are shown in this example (4 segments), as there are higher frame relations included according to the segment size.

In this section, we first describe the complete ST modeling architecture, which is based on a 2D CNN feature extraction part and one ST modeling block. Afterward, we investigate different ST modeling techniques in detail that can be used within this architecture. Finally, we will give the training details used in the experiments.

3.1. ST Modeling Architecture

As illustrated in Fig. 1, a video clip V that contains a complete action/gesture is divided into N segments. Each segment is represented as of sequential frames with spatial resolution and channels. RGB modality is used in all of the trainings. Afterward, within segments, equidistant frames are selected and passed to a 2D CNN model for feature extraction. Extracted features are transformed to a fixed size of 256 (except for TSN where features are transformed to number-of-classes) via a one-layer Multi-layer Perceptron (MLP).

For feature extraction, two different CNN models are used: (i) SqueezeNet (Iandola et al., 2016)

with simple bypass and (ii) Inception with Batch Normalization (BN-Inception)

(Ioffe and Szegedy, 2015). The reason to choose these models is that the performance of the investigated ST modeling techniques can be evaluated with a lightweight CNN feature extractor (SqueezeNet) and relatively more complex and heavyweight CNN feature extractor (BN-Inception). In this way, CNN-model-agnostic performance of evaluated techniques can be observed.

Extracted features are finally fed to an ST modeling block, which produces the final class scores of the input video clip. Next, we are going to investigate different ST modeling techniques in detail that are used in this block.

Figure 3. Simple MLP technique. Extracted features are concatenated keeping their order same to form

dimensional vector. This vector is fed to a 2-layer MLP to get final class scores.

3.2. Multi-layer Perceptron (MLP) based Techniques

MLP-based ST modeling techniques are simple but effective to incorporate temporal information. These techniques make use of MLPs once or multiple times. Extracted features are then fed to these MLP-based ST modeling blocks keeping their order intact. The intuition is that MLPs can capture the temporal information of the sequence inherently without knowing that it is a sequence at all.

3.2.1. Simple MLP

As illustrated in Figure 3, extracted features are concatenated to keep preserve their order. Then, the concatenated single dimensional vector is fed to a 2-layer MLP with 512 and Number-of-classesneurons. Finally, the output is fed to a softmax layer to get class conditional scores.

This is a simple but effective approach. Combined with other modalities such as optical flow, infrared and depth, competitive results can be achieved (Köpüklü et al., 2018).

3.2.2. Temporal Segment Network (TSN)

TSN aims to achieve long-range temporal structure modeling using sparse sampling strategy (Wang et al., 2016). When the original paper was written, TSN achieved state-of-the-art performance on two activity recognition datasets, namely the UCF-101 (Soomro et al., 2012) dataset and the HMDB (Kuehne et al., 2011) dataset.

The original TSN architecture uses optical flow and RGB modalities, as well as different consensus methods such as evenly averaging, maximum, and weighted averaging. Among them, evenly averaging achieved the best results in the original experiments. Therefore, we have also experimented with evenly averaging for RGB modality only.

Figure 4. Temporal Segment Network (TSN) architecture. Extracted frame features are transformed to Number-of-classes dimension and averaged. Then the resulting vector is fed to softmax layer to get class conditional scores.

The corresponding TSN approach is depicted in Figure 4. Unlike other ST modeling techniques, the extracted frame features are transformed into a fixed size of number-of-classes instead of 256. Afterward, all extracted features are averaged and fed to a softmax layer to get class conditional scores.

Although TSN achieved state-of-the-art performance on UCF-101 and HMDB benchmarks at the time, it achieves inferior performance in the Jester and Something-Something benchmarks. The reason is that averaging causes loss of temporal information. This does not create a huge problem for the UCF-101 and HMDB benchmarks as temporal order is not critical for these. Correct classification can even be achieved using only one frame of the complete video. However, the Jester and Something-Something datasets require the incorporation of the complete video in order to infer correct class scores.

3.2.3. Temporal Relation Network (TRN)

TRNs (Zhou et al., 2018) aim to discover possible temporal relations between observations at multiple time scales. The main inspiration for this work comes from the relational reasoning module for visual question answering (Santoro et al., 2017). The pairwise temporal relations (2-frame relations) on the observations of the video V are defined as

where the input is the features of the selected frames of the video V = {, , …, }, in which represents the feature of the frame segment extracted by a 2D CNN. Here, and represent the feature fusing functions, which are MLPs with parameters and , respectively. For these functions, the exact same MLP block as depicted in Figure 3 is used. These two-frame temporal relations functions are further extended to higher frame relations such as three-frame relations given by

where the order of the segments should always be kept same as to learn temporal relations inherently. Finally, all frame relations can be incorporated in order to get a single final output

referred to as multiscale TRN, where each captures temporal relationships between features of ordered frames. The overall TRN architecture is depicted in Figure 2.

3.3. Recurrent Neural Networks (RNN) based Techniques

Recurrent neural networks (RNNs) are a special type of artificial neural networks and consist of recurrently connected hidden layers which are capable of capturing temporal information. Furthermore, they allow the input and output sequences to vary in size. Consider an input sequence with . Each input is passed to a hidden layer with units. The output of a hidden layer depends both on the current input and the previous hidden state . However, the output layer of the RNN depends only on the current hidden state. All in all, we can express the structure of the RNN as

with activation functions

and , where and are the input-to-hidden, hidden-to-hidden and hidden-to-output weight matrices, respectively, and denote the hidden and output layer biases. It is important to note that the hidden layer parameters do not depend on the time step but are shared across all RNN slices. The ability to keep information from previous time steps makes the hidden layer work like a memory.

Figure 5. M-layered architecture of Recurrent Neural Networks.

The shared parameters of an RNN can be learned by a method called backpropagation through time. Theoretically, the hidden layer allows the network to learn any relations from the past. In practice, however, it turns out that classic RNNs suffer from two problems. By recursively forming derivatives, the gradient may vanish (vanishing gradient) or become too large (exploding gradient), which significantly limits the ability of classic RNNs.

In our experiments, we use two different classical RNNs, based on the hyperbolic tangent activation function and the rectified linear unit (ReLU) activation function, respectively. The ReLU function reduces the problem of vanishing gradients. Generally, we feed the output of the last node to a fully connected layer to obtain a vector size of the number of classes in the dataset. We also proceed in the same manner for all other RNN types except for the

Bidirectional LSTM.

Figure 6. Representation of the internal structure of Vanilla RNN (a), LSTM (b) and GRU (c).

3.3.1. Long Short-Term Memory (LSTM)

LSTMs (Hochreiter and Schmidhuber, 1997) are recurrent neural networks consisting of a cell, an input gate, a forget gate, and an output gate. The input gate decides how much the current contributes to the overall output. The cell is responsible for remembering the previous state information, and also uses the results of the forget gate , which decides how much of the previous cell flows into the current cell. As the name suggests, the forget gate can completely erase the previous state if necessary. Finally, the output gate determines the contribution of the current cell . All in all, the standard LSTM can be described by the following equations

where , , are the input-to-hidden weights, hidden-to-hidden weights and biases, respectively. LSTMs reduce the problem of vanishing gradients, because the update and forget gate have the ability to force retention of important information.

Figure 7. The Bidirectional LSTM extends regular LSTM by training it both in positive and negative time direction (Schuster and Paliwal, 1997). Each output is obtained by concatenation of the two LSTM outputs that belong to the same time value. We exploit BLSTM structure by concatenating the first and the last halved output, and finally, we apply a fully connected layer and softmax for classification.

3.3.2. Gated Recurrent Units (GRU)

GRUs (Cho et al., 2014) are very similar to LSTMs and consist of two gates - an update gate and a reset gate . However, unlike LSTMs, GRUs do not have their own memory control mechanism. Instead, the entire hidden layer information is directed to the next time step. The advantage of GRUs compared to LSTMs is their simplicity in structure, which significantly reduces the number of parameters to be learned. Mathematically, the structure of a GRU can be described by

where , , are the corresponding weights and biases.

3.3.3. Bidirectional LSTM (BLSTM)

BLSTMs are a special form of LSTMs, but are trained in both directions. The architecture is shown in Figure 7. The fully connected layer is obtained by concatenating two halved outputs and , namely the first output of the positive time direction and the last output of the negative time direction. We also investigate the effect of the hidden size by reducing it to half of the hidden size value we used for the other RNN-structures. This allows us to make meaningful comparisons with the latter. The reduction of the hidden layer size means that the vector size remains unchanged before the last fully connected layer. Consequently, the same number of output neurons is used for the classification.

3.4. Fully Convolutional Network (FCN)

Layer / Stride

Filter size Output size
Input 1N256
Conv1/s(1,2) 33 64N128
Conv2/s(1,2) 33 64N64
Conv3/s(1,2) 33 128N32
Conv4/s(1,2) 33 128N16
Conv5/s(1,2) 33 256N8
Conv10/s(1,1) 11 NumClsN8
AvgPool/s(1,1) N8 NumCls
Table 1. Details of the fully convolutional ST modeling architecture.
Figure 8. Histogram of video lengths for (a) Jester-V1 and (b) Something-Something-V2 datasets.

The inputs to FCN are the concatenated feature vectors of each segment resulting such that each row represents features from a segment. The input volume enters a series of 2D convolutions with stride , which keeps the temporal dimension (i.e. the number of segments) intact throughout convolution operations. The kernel size is set to

with the same padding for all convolutions. After applying five convolutions, 2D convolution with

kernel is applied where the number of channels equals the number of classes. Finally, average pooling with is applied to get class conditional scores. After each convolution, batch normalization and ReLU is applied. The details of the used FCN are given in Table 1.

3.5. Training Details

Given the ST modeling architecture in Figure 1, the CNN architecture used to extract frame features plays a critical role in the performance of the overall architecture. In order to get CNN-model-agnostic performance of the applied ST modeling techniques, the SqueezeNet and BN-Inception models are used. For both models, features are transformed to 256-dimensional vectors (Number-of-classes

-dimensional vectors for only TSN) via an MLP after global pooling layer. For all experiments, CNN models pretrained on ImageNet dataset are used.

Learning:Stochastic gradient descent (SGD) with standard categorical cross-entropy loss is applied. For momentum and weight decay, and are used, respectively. The learning rate is initialized with and reduced twice with a factor of after validation loss converges.

Regularization: Several regularization techniques are applied in order to reduce over-fitting and achieve a better generalization. Weight decay of is applied to all parameters of the architecture. A dropout layer is applied after the global pooling layer of 2D CNN architectures with a ratio of . Moreover, data augmentation of multiscale random cropping is applied for all training.


The complete ST modeling architecture is implemented and trained (end-to-end) in PyTorch. We make our code publicly available

222Source: for reproducibility of the results.

4. Experiments

4.1. Datasets

The Jester-V1 dataset is currently the largest hand gesture dataset that is publicly available (1). It is an extensive collection of segmented video clips that contain humans performing pre-defined hand gestures in front of a laptop camera or webcam. The dataset consists of 148092 video clips under 27 classes, which is split into training, validation and test sets containing 118562, 14787 and 14743 videos, respectively. For the experiments, the validation set is used as the labels of the test set are not made available by dataset providers.

The Something-Something-V2 dataset is a collection of segmented video clips that show humans performing pre-defined basic actions with everyday objects (Goyal et al., 2017)

. It allows researchers to develop machine learning models capturing a fine-grained understanding of basic actions. The dataset consists of 220847 video clips under 174 classes, which is split into training, validation and test sets containing 168913, 24777 and 27157 videos, respectively. For the experiments, the validation set is used as the labels of the test set are not made available by dataset providers.

The histograms for the duration of video clips are given in Fig. 8 and Fig. 8 for the datasets Jester and Something-Something, respectively. The duration of gesture clips in Jester dataset is concentrated between 30 - 40 frames. However, the Something-Something dataset has videos with relatively varying temporal dimension between 20 and 70 frames, which is the reason why 3D CNN architectures accepting fixed-size inputs are not suitable for this benchmark. In order to recognize video clips correctly, the used architectures should incorporate information coming all part of the videos.

Model MFLOPs Params Accuracy (%)
Jester (8 seg.) Something (8 seg.) Something (16 seg.)
Squeez. BNIncep. Squeez. BNIncep. Squeez. BNIncep.
Simple-MLP 2.13 1.06M 87.28 92.80 31.89 46.35 33.96 47.01
TSN 0.001 0.00M 72.84 82.74 20.91 37.28 22.15 36.22
TRN-multiscale 11.95 2.34M 88.39 93.20 33.73 46.91 34.38 47.73
RNN_tanh 3.16 0.14M 70.51 79.53 16.12 25.17 14.48 21.64
RNN_ReLU 3.16 0.14M 78.33 88.15 21.40 36.01 15.84 24.88
LSTM 8.42 0.53M 84.28 90.80 25.24 39.04 28.25 42.83
GRU 6.32 0.40M 83.10 90.86 25.40 40.69 30.24 43.31
B-LSTM 6.33 0.40M 84.87 91.12 25.04 39.35 27.88 42.41
FCN 39.07 0.56M 88.11 93.64 27.72 39.17 29.95 40.59
Table 2. Comparison of different ST modeling techniques over classification accuracy, number of parameters and computation complexity (i.e., number of Floating Point Operations - FLOPs). Methods are evaluated using 8 and 16 segments on validation sets of Jester-V1 and Something-Something-V2 datasets. The number of parameters and FLOPs are calculated for only ST modeling blocks excluding CNN feature extractors for Jester dataset.

4.2. Resource Efficiency Analysis

For real-time systems, the resource efficiency of the applied ST modeling techniques is as essential as the achieved classification accuracy. Therefore, we have investigated the number of parameters and floating-point operations (FLOPs) of each technique.

Out of all ST modeling techniques, TSN comes for free since it requires no parameters and there is only averaging operation. However, temporal information is lost due to averaging, which results in inferior performance compared to simple-MLP or TRN-multiscale techniques.

In terms of number of parameters, TRN-multiscale requires the highest number with 2.34 M parameters as it incorporates multi-frame relations with MLP blocks. In terms of FLOPs, FCN requires the highest number of operations with 11.95 MFLOPs. However, the resource efficiency of the feature extractors (i.e. 2D CNNs) are also important. The BN-Inception architecture contains 11.30 M parameters and requires 1894 MFLOPs to extract features of a frame. On the other hand, the SqueezeNet architecture contains only 1.24 M parameters and requires 338 MFLOPs to extracted the features of a same-sized frame.

4.3. Results Using Jester Dataset

For the Jester dataset, the spatial content for all classes are the same: A hand in front of a camera performing a gesture. Therefore, a designed architecture should capture the form, position, and the motion of the hand in order to recognize the correct class.

Comparative results of different ST-modeling techniques for Jester dataset can be found in Table 2. Inspired from (Köpüklü et al., 2018), we have used eight segments for this benchmark as it achieves the best performance for MFF architecture. Compared to BN-Inception, architectures with SqueezeNet have 5% to 10% inferior classification accuracy for the same ST modeling technique. However, the technique-wise comparison remains similar within the same 2D CNN backbone.

Out of all ST modeling techniques, TRN-multiscale and FCN stand out for classification accuracy. Considering the resource efficiency, the simple-MLP model can also be preferred over TRN-multiscale. Surprisingly, RNN-based methods, which first come to mind for modeling sequences, perform worse than these techniques. Within the RNN-based techniques, LSTM and GRU perform better than the others. Among the vanilla RNNs, the RNN with activation performs the worst. As expected, TSN also yields low classification accuracy as the averaging operation causes a loss of temporal information.

4.4. Results Using Something-Something Dataset

Compared to the Jester dataset, the Something-Something dataset contains much more classes with more complex spatial content. In order to identify the correct class label, the designed architectures need to extract the spatial content and temporally link this content successfully. Therefore, the frame feature extractors (i.e., 2D CNNs) are critical for the overall performance.

Comparative results of different ST-modeling techniques for the Something-Something dataset can be found in Table 2. Beside 8-segment architectures, we have also made experiments for 16-segment architectures as the spatial complexity of the dataset is higher compared to Jester. Due to this complexity, architectures with SqueezeNet have 10% to 15% inferior classification accuracy compared to architectures with BN-Inception. However, similar to Jester dataset, the technique-wise comparison remains similar within the same 2D CNN backbone.

Compared to 8-segments, 16-segment architectures perform better. Out of all ST modeling techniques, TRN-multiscale again stands out for classification accuracy. However, FCN cannot achieve outstanding performance it reached in the Jester dataset and performs inferior to GRU and LSTM. Within RNN based techniques, LSTM and GRU perform better than the others. Similar to the Jester dataset, Vanilla RNN with activation and TSN yield low classification accuracy.

5. Conclusion

In this work, we have analyzed various techniques for CNN-based spatiotemporal modeling and compared them based on a consistent 2D CNN feature extraction of sparsely sampled frames. The individual methods were then evaluated on the Jester and Something-Something dataset. It has been shown that the CNN models used for feature extraction and the number of frames sampled affect the results. For the Jester dataset, the TRN and the FCN model achieved the best results using both Squeezenet and BNInception. On the Something-Something dataset, on the other hand, the TRN model clearly outperformed all other models. It has also been shown that simple vanilla RNNs are unable to understand the complex spatiotemporal relationships of the data. All the more complex RNNs tested perform very similarly.

Interestingly, the TSN model, which showed state-of-the-art performance on the UCV-101 and HMDB benchmarks, performs rather poorly in our experiments, which shows the importance of maintaining the temporal information. Among the models tested, TRN requires the highest number of parameters, and FCN is the most expensive in terms of floating-point operations. While some models like TRN, LSTM, GRU, and B-LSTM can benefit from an increase in the number of segments, Vanilla RNNs and the TSN model can suffer from overfitting. One possibility for future research would be to develop TRN-like models that achieve good results but with fewer parameters and floating-point operations while keeping the number of required segments small.


  • [1] (2019) Note: Cited by: §4.1.
  • M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt (2011) Sequential deep learning for human action recognition. In International workshop on human behavior understanding, pp. 29–39. Cited by: §2.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6299–6308. Cited by: §1.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder–decoder approaches

    pp. 103–111. Cited by: §3.3.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. pp. 2625–2634. Cited by: §2.
  • C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: §2.
  • R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. 1 (2), pp. 3. Cited by: §4.1.
  • K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555. Cited by: §1, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.3.1.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §3.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.
  • O. Köpüklü, N. Kose, A. Gunduz, and G. Rigoll (2019) Resource efficient 3d convolutional neural networks. arXiv preprint arXiv:1904.02422. Cited by: §2.
  • O. Köpüklü, N. Kose, and G. Rigoll (2018) Motion fused frames: data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111. Cited by: §3.2.1, §4.3.
  • O. Köpüklü and G. Rigoll (2018) Analysis on temporal dimension of inputs for 3d convolutional neural networks. In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS), pp. 79–84. Cited by: §2.
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §3.2.2.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §1.
  • J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pp. 816–833. Cited by: §2.
  • P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo (2018) Multi-level wavelet-cnn for image restoration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §1.
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §3.2.3.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: Figure 7.
  • K. Simonyan and A. Zisserman (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 568–576. Cited by: §1, §2.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §1, §3.2.2.
  • L. Sun, K. Jia, D. Yeung, and B. E. Shi (2015) Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605. Cited by: §2.
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2.
  • V. Veeriah, N. Zhuang, and G. Qi (2015) Differential recurrent neural networks for action recognition. In Proceedings of the IEEE international conference on computer vision, pp. 4041–4049. Cited by: §2.
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Computer Vision – ECCV 2016, pp. 20–36. Cited by: §1, §2, §2, §3.2.2.
  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §2, §2, §3.2.3.