Log In Sign Up

IF-TTN: Information Fused Temporal Transformation Network for Video Action Recognition

by   Ke Yang, et al.

Effective spatiotemporal feature representation is crucial to the video-based action recognition task. Focusing on discriminate spatiotemporal feature learning, we propose Information Fused Temporal Transformation Network (IF-TTN) for action recognition on top of popular Temporal Segment Network (TSN) framework. In the network, Information Fusion Module (IFM) is designed to fuse the appearance and motion features at multiple ConvNet levels for each video snippet, forming a short-term video descriptor. With fused features as inputs, Temporal Transformation Networks (TTN) are employed to model middle-term temporal transformation between the neighboring snippets following a sequential order. As TSN itself depicts long-term temporal structure by segmental consensus, the proposed network comprehensively considers multiple granularity temporal features. Our IF-TTN achieves the state-of-the-art results on two most popular action recognition datasets: UCF101 and HMDB51. Empirical investigation reveals that our architecture is robust to the input motion map quality. Replacing optical flow with the motion vectors from compressed video stream, the performance is still comparable to the flow-based methods while the testing speed is 10x faster.


page 1

page 3

page 5

page 6

page 8


Sequential Deep Trajectory Descriptor for Action Recognition with Three-stream CNN

Learning the spatial-temporal representation of motion information is cr...

Three-Stream Fusion Network for First-Person Interaction Recognition

First-person interaction recognition is a challenging task because of un...

Convolutional Two-Stream Network Fusion for Video Action Recognition

Recent applications of Convolutional Neural Networks (ConvNets) for huma...

Video Modeling with Correlation Networks

Motion is a salient cue to recognize actions in video. Modern action rec...

Semantic Image Networks for Human Action Recognition

In this paper, we propose the use of a semantic image, an improved repre...

Memory-Augmented Temporal Dynamic Learning for Action Recognition

Human actions captured in video sequences contain two crucial factors fo...

A Fusion of Appearance based CNNs and Temporal evolution of Skeleton with LSTM for Daily Living Action Recognition

In this paper, we propose efficient method which combines skeleton infor...

1 Introduction

Video action recognition has been widely studied by the computer vision community

[15, 20] as it can be applied in many areas like intelligent video surveillance and human behavior analysis. Since CNNs have achieved great successes in image classification task [9, 16, 10] and video action recognition can be considered as a classification task, a lot of CNN-based action recognition methods have been proposed [20, 27, 15, 2]

. Compared to the image classification methods, temporal information is also critical for video action recognition. Appearances and dynamics are crucial and complementary aspects. The performance of video action recognition highly depends on how the algorithms utilize the relevant temporal information in cooperation with spatial features. Many CNN-based action recognition methods are proposed to classify videos in terms of their spatiotemporal features

[15, 20, 23, 27, 2]. Among them, Two-Stream CNN [15] and C3D [20] are the most representative methods.

Figure 1: Class visualization of TSN and TTN models using DeepDraw on two action categories: “HighJump” and “GolfSwing”. For each category, visualized images are shown on bottom row, and the RGB images similar to the visualized images are placed on top row. The results of TSN are shown in left column, and the results of TTN are shown in middle and right columns since TTN takes snippets from two adjacent segments as input. We can observe that TTN does capture the temporal order between snippets while TSN mainly replies on the object and scene.

In a common Two-Stream CNN framework, appearances and dynamics are often decoupled and lost valuable connection during learning feature. Intuitively, human beings identify a specific action from video mainly by recognizing dynamics over appearances, namely, motions of objects rather than recognizing dynamics and appearances separately. C3D is proposed to encode appearance and motion information simultaneously by 3D convolution upon multiple consecutive video frames, but its performance is limited compared with Two-Stream based methods, which means that the effective fusion of spatial and temporal features is still under exploration.

Besides spatial and temporal feature fusion, temporal order modeling is also lack of studies. Most of the existing works rely on short-term temporal modeling due to small temporal receptive window size and consecutively sampling strategy. To model long-term temporal structure, Temporal Segment Network [27] sparsely sampled frames and aggregated snippet features over a whole video. However, it simply treats a video as a bag of snippets and does not capture the temporal order that reflects transformations between video snippets.

In this paper, we propose Information Fused Temporal Transformation Network (IF-TTN) for video action recognition based on the Temporal Segment Network (TSN) framework. In order to extract more effective spatiotemporal features, we proposed an Information Fusion Module (IFM) to fuse the appearance and motion features at multiple ConvNet levels for each video snippet. The fused features depict what (captured by the spatial stream) moves in which way (captured by the temporal stream). With the fused features, we designed a Temporal Transformation Network (TTN) to model the temporal order between the neighboring snippets.

The visualization results of TTN are shown in Figure 1. It can be observed that TTN does learn the transformation between neighboring snippets. Taking the “HighJump” in Figure 1 as an example, TTN models the transformation between human running in front of a high jump crossbar and human falling on the mat after skipping the crossbar. The generated image of the first snippet depicts the running human. The generated image of the second snippet depicts the mat that the jumper falling on. Convolution is invariant to translation and scale, thus objects in generated images, such as people and mat, are with different scales and spatial locations, which makes the generated images look cluttered.

In addition, this kind of temporal transformations between snippets actually describe the mid-term temporal structure of a video, which is complementary to the short-term temporal descriptor and long-term temporal structure. Therefore, our network comprehensively considers multiple granularity temporal features. In our work, the reasonably structured modeling of the temporal features and the complementary fusion from multi-level spatial features reduce the dependency on motion input quality. Replacing optical flow with the motion vectors from compressed video stream, the performance is still competitive with the optical flow-based methods while the testing speed is 10x faster. Our contributions can be summarized as follows:

  • We design IFM to fuse two stream features. This design involves appearance and motion information simultaneously and further benefits the temporal modeling.

  • We design TTN to model the temporal order of video snippets at multiple feature levels.

  • We combine IFM and TTN to form IF-TTN, which can be trained in an end-to-end manner. IF-TTN is robust to motion input quality owing to effective feature learning, which makes it practical in the real-time scenarios. IF-TTN achieves state-of-the-art results on both non-real-time and real-time action recognition tasks.

Figure 2: Overall architecture of IF-TTN. A video is divided into segments ( is set to 3 in this illustration). From each segment a video frame is randomly sampled to represent corresponding segment. These frames are arranged in a strict temporal order and passed through Two-Stream CNN to extract appearance and temporal features at multiple network stages. Two stream features are then fused by IFM and fed to TTN. TTN takes features from previous and next sampled video frames as input and models the temporal order between them.

2 Related Work

Action recognition: Improved Dense Trajectory Feature (iDTF) [23, 24]

has been in a dominant position in the field of action recognition. Recently, 2D Convolutional Neural Networks trained on ImageNet

[14] were employed to perform RGB image classification. But their performance on video classification was limited as they can only capture appearance information. In order to model motion information, Two-Stream CNN was proposed and got a significantly boost in performance by taking both RGB images and optical flow as inputs. To model spatiotemporal feature better, Tran et al. proposed 3D CNN architecture called C3D in an attempt to directly extract high-level semantics spatiotemporal abstraction from raw videos [20] and then proposed Res3D to further improve recognition performance [21]. To take advantage of both Two-Stream CNN and 3D CNN, a Two-Stream Inflated 3D CNN (I3D) was proposed and allowed for initialization with ImageNet pre-trained weights [2].

Temporal Structure Modeling: Plenty of works have been dedicated to model the temporal structure for action recognition [13, 7, 25, 27]

. With the development of Deep Learning, many recent works modeled the temporal structure via network design. Temporal Segment Network (TSN)

[27] was proposed to model temporal structure on the entire videos in an end-to-end manner. However, TSN failed to capture the temporal order of video frames. Zhou et al. proposed a Temporal Relation Network (TRN) [33] to learn and reason about temporal dependencies between video frames at multiple time scales. In [30] and [3]

, Long Short-Term Memory (LSTM) networks were used to capture the long-range dynamics for action recognition.

Real-time action recognition: State-of-the-art video understanding methods relied heavily on optical flow. The heavy computation cost of optical flow prevented these methods from real-time implementation. There were a few works dealt with real-time video understanding by replacing the costly optical flow with low-cost motion representations. Bilen et al. proposed dynamic image (DI) [1] to simulate the motion information, and Sun et al. proposed Optical Flow guided Feature (OFF) [19] to model short-term temporal variation (e.g. at a temporal length of about 7 frames1117 frames are calculated from the training strategy of OFF. TSN Two-Stream CNN used 5 stacked optical flow frames to model short-term motions. Thus 7 frames belong to short-term motions.). Motion Vector (MV) was a coarse representation of motion, but it can be obtained directly from compressed video streams without extra calculation. Therefore, Enhanced Motion Vectors CNN (EMV-CNN) [32] used motion vector as the input of temporal CNN to improve inference speed and CoViAR [29] adopted an accumulated motion vector for real-time action recognition. Suffered from the lack of fine detailed motion information in MV, recognition performance was degraded dramatically. The performance of both EMV-CNN and CoViAR was far behind Two-Stream CNN with optical flow.

The works most similar to our work are [5, 6] and [32]. The work in [5] studied the additive fusion of spatial and temporal features of Two-Stream CNN. Their follow-up work [6] studied the multiplicative fusion. Compared with that, our contribution is to design a more general and effective fusion module, which jointly operates additive and multiplicative interactions. Moreover, we use adaptively weighted interaction items during fusion and balance their impacts through learning on weight parameters. Experiment results also show that our fusion module performs better than the fusion type used in [5, 6] as reported in the Section 4.3.

EMV-CNN [32] first used motion vectors as motion representation for real-time action recognition. They replaced the optical flow with motion vectors in Two-Stream CNN and developed transferring techniques to enhance the MV-CNN, but the performance was much lower than the state-of-the-art optical-flow-based methods. In this paper, we experimentally prove that motion-vector-based Two-Stream CNN can achieve comparable performance to optical-flow-based methods if we adopt effective feature learning rather than the simple usage of motion vectors. We build a model with more reasonably structured modeling and the complementary feature fusion to make the network tolerant to the low quality of motion input. Experiment results prove that our network is highly tolerant to the quality of motion input thanks to the combination of short-term spatiotemporal feature fusion, sequentially middle-term temporal modeling and long-term temporal consensus.

3 Information Fusion Temporal Transformation Network

In this section, we describe the Information Fusion module (IFM) that fuses the features from Two-Stream CNN, the Temporal Transformation Network (TTN) and the real-time adaption of our network.

The overall network architecture of IFM-TTN is shown in Figure 2. Given a video containing frames, we first divide it into segments . For - segment , we randomly sample a frame from it, called snippet. The assemble of sampled frames

are fed into deep feature extractor for feature extraction. Each snippet is processed by Two-Stream CNN to extract features as

and . and are features extracted from spatial and temporal stream networks, respectively. and are the functions representing the spatial stream CNN and temporal stream CNN with parameters and , respectively.

Instead of using the features only from final convolutional layer, we involve the features from multiple stages of CNN to encode snippets at multiple spatial scales. Assuming that the CNN has stages, and , where represents the stage from which we start to extract features.

3.1 Information fusion module

Figure 3: Illustration of information fusion module. (a) Attention based. (b) Adaptive fusion.

Feature fusion: It is desired to fuse the features of spatial and temporal networks to generate an efficient and compact representation for each snippet. Given a feature pair and , we can get the fused features with an Information Fusion Module (IFM):

Figure 4: Illustration of temporal structure modeling. We only show 3 segments for convenience. (a) Temporal Segment Network. (b) Temporal relation Network (TRN) in [33] . (c) Our Temporal Transformation Network (TTN).

where represents the fusion function, denotes the fused features for - segment at - CNN stage .

We investigate two implementations of fusion modules:

(1) Attention based fusion module: There exists a common insight that the temporal feature maps can act as the attention maps to the corresponding spatial feature maps, because optical flow can locate human foreground areas and is invariant to appearance. Besides the motion patterns, the scenes and objects in spatial stream are also important for classification, especially for the actions with subtle motions. For example, the recognition of musical instrument is important to the recognition of playing musical instrument. Taking account of these two factors, we formulate the function as:


where corresponds to element-wise multiplication. can also be viewed as a residual term. In other words, we enhance the features of interest rather than removing the features that are not attended.

(2) Adaptive fusion module: Recently, there are a series of works [6, 5] studying the fusion of spatial and temporal streams of Two-Stream CNN. In these works, additive [5] and multiplicative interactions [6] were considered separately. We propose adaptive fusion that covers interactions on both additive scale and multiplicative scale. We weight interaction items during fusion and balance their impacts through learning on weight parameters:


where are learnable weight parameters that update with the whole network. From Equation 2 and 3, it can be derived that attention based fusion module is a special case of adaptive fusion module when .

After fusion, the fused features are ready to be fed to the TTN, where represent the fused features of - snippet.

3.2 Temporal transformation network

Given a sequence of fused features , Temporal Transformation Network (TTN) is proposed to to model the pairwise temporal transformations as below:


where denotes the TTN features over the whole video. are the transformation function representing the TTN with network parameters . TTN integrates the fused features of ordered snippet-pairs.

We construct TTN using a standard CNN architecture and take snippet features from multiple stages as inputs, as shown in Figure 2. In this way, low-level detailed features are kept while exploring the temporal relation. In order to simplify the network structure, we only keep the features pairs and from adjacent segments, that is, .

Between every two stages of the TTN, Temporal Transformation Modules (TTM) are designed to merge the features from adjacent segments. Figure 5 shows the data flow of TTM, and the merging process can be formalized as follows:


where represents stage index, represent segment index. are - stage features from the - segment, respectively. represents the input of - stage in TTN, while is the output features at - stage. denotes temporal transformation operator that modeling the temporal order of video segments. Image differences are commonly utilized to model the appearance change. Inspired by this, we use feature difference to reflect the ordered temporal transformation, that is, . In this manner, the shape of feature map does not change, thus it allows us to use the pre-trained weights for TTN.

Figure 5: Illustration of TTM. This figure shows the TTM between network stage and stage

Similar to our TTN, Zhou et al. [33]

proposed a Temporal Relational Network (TRN) built on top of TSN to model the pairwise temporal relations between ordered frames. However, they only used the features extracted from the last fully connected layer of CNN and deployed simple multilayer perceptrons (MLP) to model the relations. As the result, the spatial and low-level detailed features were completely lost before features are fed to the relation network. One can easily find the differences among TSN, TRN, and TTN from Figure


3.3 Real-time adaption

The networks based on Two-Stream CNN have achieved superior performance on recognition accuracy. However, the computation costs of optical flow make it impossible to apply these networks to the real scenarios. One popular solution is to use alternative motion representations as temporal stream CNN input, which could improve inference speed but might lead to degradation on recognition accuracy.

Considering that motion vectors are inherently correlated with optical flow and can be extracted from compressed video stream directly with slight cost, it is desired to see whether the reasonable structured modeling and the complementary feature fusion make IF-TTN tolerant to the low image quality of motion vectors.

We directly replace the input of temporal stream network with motion vectors. Before training motion-vector-based IF-TTN, we first train the optical-flow-based IF-TTN, and then initialize the motion-vector-based network with optical flow pre-trained weights. In our implementation, we do not use any image preprocessing techniques to improve the quality of motion vectors.

Since the extraction overhead of motion vectors is negligible, the video inference can be conducted at a very fast speed with a custom GPU. The adapted IF-TTN can be processed in real-time.

Motion-vector-based CNN networks have been proposed in [32], which simply replaced optical flow with motion vectors and transferred knowledge from optical-flow-based networks. Without in-depth exploration of the spatiotemporal structure, its performance was much lower than the state-of-the-art optical-flow-based methods. Our paper proves that motion-vector-based networks can achieve comparable performance to optical-flow-based networks with effective spatiotemporal feature modeling.

3.4 Training and inference

Training: Action recognition is a multi-class classification problem. We use the standard categorical cross-entropy loss to supervise the network optimization. In order to reduce the difficulty of training, we adopt a progressive multi-stage training strategy. First, we train a standard TSN [27] with ResNet-50 backbone. Then, we freeze the TSN feature extractor, and train the TTN following a similar training strategy with TSN. Finally, we tune all the network jointly.

For the sake of better initialization for temporal network and TTN, following the good practice in [27], we first train the spatial network with ImageNet pre-trained weights. Then, we initialize the temporal network and TTN with pre-trained spatial network weights. This initialization method can speed up the training process and reduce the effect of over-fitting.

Final predictions: As there are multiple classification scores produced by each segment, we first fuse the score of each stream network separately by averaging the scores of all segments. Then, we fuse the scores from Two-Stream CNN and TTN for final predictions.

4 Experiments

In this section, we first introduce the evaluation datasets and the implementation details of our approach. Then we explore the contributions of each proposed module by the ablation experiments. Finally, we compare the performance of our method with the-state-of-the-art methods.

4.1 Dataset

We evaluate our method on two popular video action recognition datasets: UCF-101 [17] and HMDB-51 [11]. The UCF-101 dataset contains 101 action classes and 13320 video clips, and the HMDB-51 dataset contains 6766 video clips from 51 action categories. Our experiments follow the official evaluation scheme which divides a dataset into three training and testing splits and report average accuracy over these three splits. For optical flow extraction, we use TVL1 algorithm [31] implemented in OpenCV with CUDA. For motion vectors extraction, we use modified ffmpeg to extract motion vectors directly from compressed video stream without extra calculation.

4.2 Implementation details

We use ResNet-50 [9] as our TSN backbone for both temporal and spatial streams. Our TTN is truncated from ResNet-50 and consists of three stages, namely stage 3, 4, 5 of ResNet-50. TTN does not involve stages lower than stage 3, as the fusion of the lower stages might suffer from noises and extreme large feature distances. Network truncation can greatly reduce computation cost and the consumption of GPU memory when training. For the segment number , we set it to 7 to model the temporal structure.The average segment interval is around 1 second, which is closed to the length of an atomic action [8]. Therefore, the transformation between adjacent segments can be regarded as a mid-term motion. The Two-Stream CNN captures the temporal structure at a time length about 0.2 second which can be regarded as sub-atomic action or a short-term motion.

We use the mini-batch stochastic gradient descent (SGD) algorithm to optimize the network parameters. For spatial network, we initialize network weights with pre-trained models from ImageNet. Batch size is set to 64 and momentum set to 0.9. Learning rate is initialized as 0.001 and decreases to its 0.1 every 30 epochs. The maximum epoch is set as 80. After training spatial network, we initialize temporal network and TTN with pre-trained spatial network weights. For temporal network, we initialize the learning rate as 0.001, which reduces to its 0.1 every 100 epochs. The maximum epoch is set as 250. For TTN, we initialize the learning rate as 0.001, which reduces to its 0.1 every 80 epochs.The maximum epoch is set as 200.

To alleviate over-fitting, we use strong data augmentation strategies and large drop ratios. For data augmentation techniques, we mainly follow [27] to do location jittering, horizontal flipping, corner cropping, and scale jittering. The dropout ratio is set to 0.8 for spatial network and TTN while 0.7 for temporal network.

Figure 6: CAM visualization of spatial network, temporal network and IF-TTN. The first sampled snippet is shown in first row and the second in second row. Since IF-TTN takes two snippets as input, thus the CAMs of two snippets are the same.

4.3 Exploration Study

In this part, we study the contributions of each module of our approach. All exploration studies are performed on UCF-101 dataset.

Study on IFM: We propose IFM to fuse appearance and motion features for each video snippet at multiple ConvNet levels. To verify the effect of IFM, we conduct experiments under two settings: (1) The spatial and temporal streams are processed separately and TTNs are applied to two streams respectively; (2) IFM are used to fuse the features from Two-Stream CNN and TTN is applied to fused features. All other settings are set to the same. The experimental results are summarized in Table 1. From the results, the attention based fusion and adaptive fusion both significantly improve performance. We attribute the improvements to the ability of IFM to model better spatiotemporal features of a short video snippet. Since two types of IFM achieve equal performance, we use attention based IFM in the following experiments for simplicity. It is worth noting that our IFM-TTN only has three CNNs while separate Two-Stream TTN has four CNNs, because both spatial and temporal networks have a TTN. Therefore, IF-TTN performs much better while has much less parameters.

Method Accuracy (%)
Separate Two-Stream 94.0
Attention IFM Two-Stream 95.0
Adaptive IFM Two-Stream 95.0
Table 1: Comparison of experimental results whether using IFM modules or not. Experiments are conducted on UCF101 split 1.
Method Accuracy (%)
additive fusion [5] 93.8
multiplicative fusion [6] 94.0
Our IFM 95.0
Table 2: Comparison of experimental results whether using different fusion types. Experiments are conducted on UCF101 split 1.
Method acc.(%)
Spatial stream CNN 84.9
Temporal stream CNN 86.9
Two-stream CNN 93.1
TTN branch 92.3
complete IF-TTN 95.0

Table 3: Ablation study of IF-TTN. Experiments are conducted on UCF101 split 1.
Method optical flow motion vectors
Spatial stream CNN 84.9 84.9
Temporal stream CNN 86.8 82.5
IF-TTN 95.0 94.4
Table 4: Experimental study of motion input study. Experiments are conducted on UCF101 first split.

We also perform comparative experiments to prove whether our IFM performs better than the fusion modules in [5, 6]. The work in [5] studied the additive fusion of spatial and temporal features of Two-Stream CNN. Then they verified multiplicative fusion of the spatial and temporal streams provided performance boost over an additive formulation in [6]. We re-implement the IF-TTN with fusion module in [5] and [6] and the experiment results are shown in Table 2. Experiment results show that our IFM performs much better than the fusion type used in [5] in [6].

Method speed acc.
Two-Stream I3D[2] 14 93.4
TSN(RGB+Optical flow)[27] 14 94.0
DIN[1] 131 76.9
C3D[20] 314 82.3
TSN(RGB)[27] 680 85.5
TSN(RGB+RGB Difference)[27] 340 91.0
RGB+EMV-CNN [32] 390 86.4
CoViAR[29] 240 90.4
OFF[19] 206 93.3
MV-IF-TTN 142 94.5
Table 5: Accuracy and inference speed comparison. The unit of inference speed is the fps. Experiments are conducted on UCF101 all splits.

Study on TTN: We report the experiment results of each network branch in Table 3. TTN branch indicates that the predictions are made without ensembling classification scores of the Two-Stream CNN. All these experiments are carried out with TSN framework. From Table 3, we can conclude that TTN is complementary to Two-Stream CNN and improves the accuracy by 1.9% when combined.

Does TTN really learn the order relationship? To verify this, we adopt the DeepDraw [12]

toolbox to visualize our TTN models. This tool conducts iterative gradient ascent on input images with only white noises, and output class visualization based solely on class knowledge inside the CNN model after a number of iterations. Since Our TTN takes two adjacent snippets as inputs, we adapt DeepDraw to deal with two inputs. The visualization results of TSN and our TTN are shown in Figure

1 and 7. From the visualization results, we can observe that TTN indeed learn the temporal transformations between two ordered video frames from adjacent segments. Taking “HighJump” for example, TTN models the transformation between human running in front of a high jump crossbar and human falling on the mat after skipping the crossbar while TSN mainly replies on the scene, e.g., the high jumping mat.

Study on discriminate feature learning: We study whether IF-TTN has learned the discriminate spatiotemporal features by visualizing the class-specific discriminate regions. The class-specific discriminative regions can be derived from classification network using Class Activation Maps (CAM) method [34]. We visualized class-specific discriminative regions of spatial network, temporal network and IF-TTN, and show results in Figure 6. We can observe that spatial network mainly focuses the scene information, temporal network focuses on the short-term motion associated to that snippet, while IF-TTN covers the object regions and captures their motion track between the two snippets.

Study on motion representation: We replace optical flow with motion vectors as temporal stream inputs and evaluate the performance of motion vector based IF-TTN.

As shown in Table 4 the IF-TTN using motion vectors has 0.6% degradation in recognition performance compared with flow-based IF-TTN. The comparison with other real-time methods are provided in Table 5. DIN represents the Dynamic Image Network proposed in [1]. TSN (RGB), TSN (RGB+RGB Difference) are from [27]. OFF denotes the optical flow guided features in [19]. We denote our real-time IF-TTN as MV-IF-TTN. In order to conduct a more convincing comparison, we include two state-of-the-art optical flow based Two-Stream CNNs in the table. Our inference speed is tested on a single-core CPU (Intel Core i7-6850K) and a GeForce GTX 1080Ti GPU. From Table 5, when replacing optical flow with motion vectors as motion inputs, IF-TTN achieves very competitive performance 94.5% on UCF-101 dataset. This performance is even slightly higher than optical flow based Two-Stream TSN [27] while the inference speed achieves 142 fps, which is about 10x faster than TSN.

Experiment results prove that our network is highly tolerant to the quality of motion input thanks to the combination of short-term spatiotemporal feature fusion, sequentially middle-term temporal modeling and long-term temporal consensus. EMV-CNN and CoViAR [32, 29] also used motion vectors but the simple replacement without consideration of more effective spatiotemporal representation results in a significant performance degradation than optical-flow-based Two-Stream CNN.

Figure 7: Class visualization of TSN and TTN using DeepDraw on action categories: “CleanAndJerk” and “VolleyballSpiking”. The images are arranged in the same way as in Figure 1. The black image indicates that there is no obvious corresponding RGB image to the generated one.

4.4 Comparison with the state of the art

In this subsection, we compare IF-TTN with the state of the art. All experiment results are evaluated on HMDB-51 and UCF-101 over all three splits and shown in Table 6

The upper part of Table 6 shows non-real-time methods while the lower part presents real-time methods. Notice that for non-real-time methods we assemble the optical flow and motion vectors based IF-TTN scores to make final predictions (denoted as Full IF-TTN).

We compare our method with both traditional approaches, like iDT [24], and deep learning based methods, such as Two-Stream CNN[15], C3D [20], TSN [27], Temporal Deep convolutional Descriptors (TDD) [26], Long-term Temporal CNN [22], Spatiotemporal Pyramid Network [28], SaptioTemporal Multiplier Network [6]

, Spatiotemporal Vector of Locally Max Pooled Features (ST-VLMPF)

[4], Lattice LSTM [18], and Inflated 3D CNN (I3D) [2] and Optical Flow guided Features (OFF) [19]. Our full IF-TTN achieves state-of-the-art results on both datasets. It is especially worth noting that the performance of MV-IF-TTN significantly outperforms the previous real-time methods.

Method UCF-101 HMDB-51
iDT[24] 86.4 61.7
Two stream CNN[15] 88.0 59.4
TDD [26] 91.5 65.9
Long Term Convolution [22] 91.7 64.8
Spatiotemporal Pyramid Network[28] 94.6 68.9
Spatiotemporal Multiplier Network[6] 94.2 68.9
Two stream TSN[27] 94.0 68.5
ST-VLMPF[4] 93.6 69.5
Two-Stream I3D[2] 93.4 66.4
Lattice LSTM[18] 93.6 66.2
Full OFF[19] 96.0 74.2
Full IF-TTN 96.2 74.8
C3D[20] 82.3 -
TSN(RGB)[27] 85.7 51.0
TSN(RGB+RGB Difference)[27] 91.0 -
RGB+EMV-CNN 86.4 53.0
CoViAR[29] 90.4 59.1
real-time OFF[19] 93.3 -
MV-IF-TTN 94.5 70.0
Table 6: Comparison with state-of-the-art results. Experiments are conducted on UCF-101 and HMDB-51 over all three splits. ’-’ represents that the paper did not report the corresponding result.

5 Conclusion

In this paper, we have proposed the IF-TTN to learn discriminate spatiotemporal features for video action recognition. Specially, the IFM is designed to fuse the appearance and motion features at multiple spatial scales for each video snippet, and the TTN is employed to model the middle-term temporal transformation between the neighboring snippets. Our network achieves the state-of-the-art results on two most popular action recognition datasets. The real-time version of IFM-TTN implemented on motion vectors achieves significant improvement against the state-of-the-art real-time methods.


  • [1] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In CVPR, 2016.
  • [2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  • [3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [4] I. C. Duta, B. Ionescu, K. Aizawa, and N. Sebe. Spatio-temporal vector of locally max pooled features for action recognition in videos. In CVPR, 2017.
  • [5] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual network for video action recognition. In NIPS, 2016.
  • [6] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal multiplier networks for video action recognition. In CVPR, 2017.
  • [7] A. Gaidon, Z. Harchaoui, and C. Schmid. Temporal localization of actions with actoms. IEEE TPAMI, 2013.
  • [8] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [11] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
  • [12] A. Mathias. Deep draw., 2016.
  • [13] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV, 2010.
  • [14] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • [15] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [16] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [17] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [18] L. Sun, K. Jia, K. Chen, D.-Y. Yeung, B. E. Shi, and S. Savarese. Lattice long short-term memory for human action recognition. In ICCV, 2017.
  • [19] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. arXiv preprint arXiv:1711.11152, 2017.
  • [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [21] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.
  • [22] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE TPAMI, 2018.
  • [23] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
  • [24] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [25] L. Wang, Y. Qiao, and X. Tang. Video action detection with relational dynamic-poselets. In ECCV, 2014.
  • [26] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [27] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: towards good practices for deep action recognition. In ECCV, 2016.
  • [28] Y. Wang, M. Long, J. Wang, and S. Y. Philip. Spatiotemporal pyramid network for video action recognition. In CVPR, 2017.
  • [29] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Compressed video action recognition. arXiv preprint arXiv:1712.00636, 2017.
  • [30] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [31] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In

    Joint Pattern Recognition Symposium

    . Springer, 2007.
  • [32] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns. In CVPR, 2016.
  • [33] B. Zhou, A. Andonian, and A. Torralba. Temporal relational reasoning in videos. arXiv preprint arXiv:1711.08496, 2017.
  • [34] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.