Reversing Two-Stream Networks with Decoding Discrepancy Penalty for Robust Action Recognition

We discuss the robustness and generalization ability in the realm of action recognition, showing that the mainstream neural networks are not robust to disordered frames and diverse video environments. There are two possible reasons: First, existing models lack an appropriate method to overcome the inevitable decision discrepancy between multiple streams with different input modalities. Second, by doing cross-dataset experiments, we find that the optical flow features are hard to be transferred, which affects the generalization ability of the two-stream neural networks. For robust action recognition, we present the Reversed Two-Stream Networks (Rev2Net) which has three properties: (1) It could learn more transferable, robust video features by reversing the multi-modality inputs as training supervisions. It outperforms all other compared models in challenging frames shuffle experiments and cross-dataset experiments. (2) It is highlighted by an adaptive, collaborative multi-task learning approach that is applied between decoders to penalize their disagreement in the deep feature space. We name it the decoding discrepancy penalty (DDP). (3) As the decoder streams will be removed at test time, Rev2Net makes recognition decisions purely based on raw video frames. Rev2Net achieves the best results in the cross-dataset settings and competitive results on classic action recognition tasks: 94.6 71.1 methods who take extra inputs beyond raw RGB frames.


page 1

page 2

page 3

page 4


D3D: Distilled 3D Networks for Video Action Recognition

State-of-the-art methods for video action recognition commonly use an en...

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Most of existing video action recognition models ingest raw RGB frames. ...

Two-Stream Convolutional Networks for Action Recognition in Videos

We investigate architectures of discriminatively trained deep Convolutio...

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Two-stream networks have provided an alternate way of exploiting the spa...

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Most action recognition models today are highly parameterized, and evalu...

Cross-Enhancement Transform Two-Stream 3D ConvNets for Pedestrian Action Recognition of Autonomous Vehicles

Action recognition is an important research topic in machine vision. It ...

Coupled Recurrent Network (CRN)

Many semantic video analysis tasks can benefit from multiple, heterogeno...

1 Introduction

(a) Two-Stream 2D CNNs
(b) Two-Stream 3D CNNs
(c) ActionNet [12]
(d) Rev2Net
Figure 1: We propose a new network architecture for video action recognition that is different from all previous work. Later, we will show that reversing the two-stream networks as self-supervisions enables the model to learn more generalized video features.

Learning a robust video representation is the foundation of action recognition. It goes beyond image classification and depends on a joint modeling of both spatial and temporal cues. Basically, there are two most dominant and well-performed deep learning architectures in this area. The first one is to capture video appearance patterns with standard CNNs

[15, 21, 25, 9, 10] and try out using optical flow data in a second CNN stream to capture the motion cues [20, 33, 7, 35, 30]. This standard 2D CNN framework are not specifically designed for videos and cannot fully exploit spatiotemporal features. According to [19], the high action recognition performance of these two-stream 2D CNN models mostly relies on the invariant appearance of the optical flow data. Another influential deep architecture is the 3D CNNs [26, 11, 27], which contain 3D convolutions and 3D pooling layers operating over space and time simultaneously, thus realize a unified spatiotemporal modeling. Some more recent methods [1, 17, 32] combine 3D CNNs with the above two-stream architectures and achieve state-of-the-art results.

We believe that a good action recognition model needs to be robust to temporal disorder (considering in real-time applications, video frames can arrive out of order). It should have the ability to adaptively make decisions according to either the long-term motion cues or the short-term motion cues. We also believe that good video representations need to be more transferable. They should lead to an easy recognition even when transferred from the source domain to the target domain. Say, we train a model to recognize an action category of playing basketball with training videos in which people play basketball in outdoor courts. It would be unreasonable that our model cannot recognize playing basketball in indoor courts. However, most previous work was not focused on improving the robustness of the video action recognition models. In this paper, we first show that two-stream 2D CNNs and 3D CNNs are easy to break down with only limited disturbance of the video frames dependencies. We then introduce an video action domain adaptation experiment to show that the optical flow features are NOT transferable across datasets, which severely affects the generalization ability of existing action recognition models.

From the above experiments, we find that when making different decisions, the existing two-stream CNNs may fail to solve the discrepancy of the results obtained from different modalities appropriately, even though previous work attempted to solve this issue with relatively simple feature fusion methods such as concatenation, convolution, or bilinear pooling [33, 29]. We conjecture that such a discrepancy may deteriorate the robustness of action recognition models.

To solve this problem, we propose a new neural network for video action recognition, named the Reversed Two-Stream Networks (Rev2Net), as shown in Figure 1(d)

. We show that compared with the previous methods, this architecture can learn more generalized and robust video features. More specifically, Rev2Net consists of one encoder stream with RGB inputs, one classifier for action recognition, one decoder stream for optical flows prediction, and another decoder stream for reversed RGB frames reconstruction. The decoder streams, in a reversed form of the original two-stream networks, use different modalities of video frames as self-supervision signals. During training, the encoded video features diverge into three network branches. Whereas, at test time, Rev2Net does not need to take the pre-computed optical flow data as inputs and purely makes decisions based on raw RGB frames. Considering that the auxiliary self-supervised tasks are impossible to be helpful to the final task all the time, and sometimes they may diverge the encoder representations away from the recognition task, we penalize the feature space discrepancy between the two decoder streams. This penalty enables an adaptive learning between the two prediction tasks. In other words, the two decoder streams could interact with each other through the backpropagation, and may reach an agreement that which one (the optical flows or the reversed RGB images) can provide more useful information for the encoder feature learning.

Technically, this paper has three contributions: (1) we analyze the robustness of mainstream video action recognition models to disturbance of motion dependencies and variations of video environments, by conducting long-term and short-term frame shuffle experiments, as well as video action domain adaptation experiments. (2) we present a novel deep network called Rev2Net with better generalization abilities (more transferable) using self-supervised multi-task learning methods. The most distinctive features of Rev2Net is the decoding discrepancy penalty called DDP. It enables an adaptive, collaborative multi-task learning between different decoder streams by penalizing their disagreement in the deep feature space. (3) The Rev2Net does not require the pre-computed optical flows at test time. It achieves competitive results on classic action recognition tasks: 94.6% for UCF-101, 71.1% for HMDB-51 and 73.3% for Kinetics, even better than some state-of-the-art methods who take optical flows as inputs.

2 Related Work

Ever since the great impact of CNNs upon image classification, many researchers have been trying out reusing CNNs for video action recognition in recent years. According to an early neurological study [13], the motion cue is the primary reason for humans to recognize a range of actions. To capture the motion patterns in the spatiotemporal data, researchers have explored various deep network architectures, including practicing different connectivity methods in 2D CNNs [14], using non-local networks[31], proposing 3D CNNs by extending convolutional filters into the time domain [26, 11, 27], using temporally recurrent layers to aggregate features across longer video inputs [4, 33], as well as training a two-stream ensemble network with a second stream of CNNs fed with pre-computed optical flow frames [20]. Among all these architectures, the two-stream networks with optical flow inputs and the 3D or “pseudo” 3D CNNs have been most widely explored.

Two-Stream Networks.

The two-stream networks were first introduced to the deep video classification models by Simonyan et al. [20] which model short temporal snapshots of videos by averaging the predictions from a single RGB frame and a stack of computed optical flow frames. Noticing that RGB images could not fully exploit temporal cues, they extracted the optical flow from consecutive video frames and took them as the network inputs. The optical flow stream has brought in a significant performance gain. Since then, two-stream networks have been widely employed by many video action recognition models [2, 33, 7, 30, 29], including some 3D CNN models [1, 17, 32].

3D CNNs.

The 3D convolutions have been explored more than once [26, 11, 27]. A recent boom started with C3D [27], a 3D version of VGGnet [21] that contained 3D convolutions and the 3D pooling operating over space and time simultaneously. Intuitively, C3D realized a unified modeling of the spatiotemporal features. But a side effect comes with it, that the 3D convolutions bring an inevitable increase in the number of network parameters, making C3D hard to train. To solve this problem, Carreira et al. [1]

proposed the Inflated 3D ConvNets (I3D). As its name, I3D inflates 2D convolutional filters into 3D, making the 3D models implicitly pre-trained on ImageNet. Other designs in the very recent year focused on how to reduce the memory footprint and ease the training process of 3D CNNs. P3D

[17] and S3D [32] both replaced the convolutional filter plus a spatial convolution and a temporal convolution. T3D [3] uses a stream 3D DenseNet with multi-depth temporal pooling layer in order to get lower model footprint compared to I3D. Those 3D CNNs models achieved state-of-the-art performance especially when being incorporated into the two-stream architecture.

Action Recognition Models with Optical Flow Synthesis.

The optical flow is crucial for the performance of two-stream networks. For a better use of optical flow frames, some recent methods [19, 36, 23] went beyond taking them as network inputs. They showed that training to generate optical flows with some deep networks, e.g. FlowNet [5] and SpyNet [18], could improve the recognition performance. Sevilla-Lara et al. [19] also tried to interpret the correlations between the optical flow and action recognition results, regarding the CNN model as a “black box”. They conducted an interesting experiment by shuffling the flow frames randomly before they were fed into a 2D CNN model called TSN [30]. They argued that the power of optical flow mainly came from its invariance to the frame appearance, instead of the modeling capability to long-term motion cues. It is potentially true for these 2D CNNs since most of them are ensemble models across different sampling time stamps.

Another way to use optical flow, which was come up by ActionFlowNet [12]

previously, is to estimate optical flow by 3D CNNs. It improved action classification accuracy with the help of motion information and did not need much optical flow data as inputs. But predicting the optical flow alone will not have a consistently positive effect on learning good features, which can hardly improve the robustness and the generalization ability of this model.

3 Analysis of Robustness to Frames Disorder

Modeling temporal dynamics is essential for the video classification performance. However, considering that in some real-time applications, video frames may arrive out of order. Thus, there is a dilemma in video action recognition: on one hand, we want our model extract useful motion cues from the video frames dependency, which is the most distinct difference from the image classification problem. On the other hand, too much concentration on the coherent temporal information will lead the model less robust to the disordered frames in some real-time recognition settings. An ideal resolution to this dilemma is to learn an adaptive model which can adaptively depends its decisions on the short-term or long-term video frame dependency. Correspondingly, the most ideal situation is that actions could be distinguished as long as either the short-term video order or the long-term video order is correct. We do not discuss how to recognize totally disordered frames, as such situations are not very meaningful in the realm of video analysis.

In this section, we analyze the robustness of the existing mainstream video action recognition models using frames shuffle experiments. Plus, it remains unclear whether long-term of short-term frame dependencies are successfully modeled by 3D CNNs. We expand the work of [19] from 2D CNNs to 3D CNNs and conduct frames shuffle experiments on the state-of-the-art I3D model [1]. We train this model with correct-order frames, evaluate it with disordered frames, and measure how much the accuracy discrepancy will be. Three shuffle schemes are used: a long-term shuffle, a short-term shuffle and a complete shuffle. Concretely, we organize frame sequences into frame blocks by putting consecutive frames in each block. The long-term shuffle scheme keeps the order inside each block and rearranges blocks in a random order. This maintains the short-term dependency but breaks the long-term dependency. The short-term shuffle scheme behaves in an opposite way. It shuffles within each block but keeps the order of these blocks, which breaks long-term video dependencies. The complete shuffle scheme randomizes the order of all frames.

Model Input Modalities No Shuffle Long-Term Shuffle Short-Term Shuffle Complete Shuffle
I3D RGB 84.5 76.8 77.6 65.4
I3D Flow 90.6 73.9 73.2 51.2
TSN Flow 86.85 [19] - 78.64 [19] 59.55 [19]
Table 1: Shuffle experiments on UCF101. A previous work [19] has investigate the properties of the 2D CNNs features using frames shuffle experiments. We extend this analysis approach to 3D CNNs and further discuss their feature properties.


As shown in Table 1, we can see that: (1) Under the complete shuffle scheme, the performance of the flow network () decreases by a greater extent than that of the RGB network (), indicating that the I3D flow network is more dependent on some underlying motion cues than the I3D RGB network. But on the other hand, we may conclude that the flow network is less robust to the complete shuffle. (2) The other two shuffle experiments shows that modeling long-term and short-term motion have almost equal importance to the performance of I3D. But from the perspective of the dilemma that we have mentioned above, the 3D CNNs are easy to deteriorate when either the long-term disorder or the short-term disorder exists. Such an action recognition algorithm may be fragile in real-time applications. For a complete comparison, we borrow the TSN shuffle experiment results from [19] which are shown in the third line of Table 1. Here, we do not discuss the long-term shuffle test because the final classification score of TSN is averaged over several short snippets sampled from the whole video clip and is not sensitive to the order of them. Apparently the 2D CNN model cannot make full use of long-term motion cues. From another perspective of the dilemma mentioned above, it is not an ideal algorithm either, as it may perform worse in a fully controlled environment where the order of the arriving frames can be guaranteed.

4 Analysis of Generalization Ability

Consider that we train a model to recognize the action of playing basketball with videos in which people do this in outdoor courts. One may wonder whether this model can easily recognize the action of playing basketball in indoor courts. We believe that good features for action recognition should be easily transferred from one dataset to another.

Figure 2: Video action domain adaptation faces diverse video environments, which will illustrate the generalization ability of the action recognition model. A good action recognition model should performs robustly to the cross-dataset variations.
Model Input Modalities H U U H
TSN RGB 58.5 40.1
TSN Flow 33.2 23.0
TSN RGB + Flow 60.9 40.3
I3D RGB 56.4 41.0
I3D Flow 45.0 31.1
I3D RGB + Flow 57.9 41.5
Table 2: We explore the generalization ability of the mainstream video classification models using the cross-dataset action recognition experiments. U: UCF101. H: HMDB51. The left of the arrow is the source dataset for training. The right of the arrow is the target dataset for test. For all models, we apply a domain adaptation method DANN [8] to further explore the transfer ability of them.

Video Action Domain Adaptation.

Domain adaptation is an effective way to verify the generalization ability of video action recognition models. The purpose of domain adaptation is to learn a model from the distribution of the source domain, and to apply this model on the target domain with different (but related) distribution. As video domain adaptation is under-explored by the previous work using neural networks, we select related categories from UCF101 and HMDB51. Note that these two datasets have diverse data patterns: HMDB51 is mostly collected from movies, while UCF101 is collected from YouTube and appears to be closer to the real life. Even for the same action category, the video appearance of these two datasets are quite different, e.g. the scene complexity and the camera angles as shown in Figure 2. The distance between the distributions of video form the source dataset and the target dataset causes the main difficulties for action knowledge transfer. To evaluate the maximum transferable ability of the mainstream video action recognition models (TSN for 2D CNNs and I3D for 3D CNNs), we incorporate an domain adaptation method DANN [8]

, which was originally proposed for image transfer learning, into these models. DANN closes the distribution distance by matching the mean embeddings in the feature space across domains. We train TSN and I3D models with labels from the source dataset, and evaluate them on the target dataset.


Table 2 shows the cross-dataset results of TSN and I3D with one or two input data modalities. When taking RGB frames as inputs, the 2D CNNs and 3D CNNs have comparable results, even though 2D CNNs are incapable of learning motion dependencies from consecutive frames. Surprisingly, both for TSN or I3D, their network streams with RGB inputs outperform the the optical flow streams in the domain adaptation settings. Consequently, the overall recognition accuracy of the two-stream networks yields no further improvement compared with the one-stream RGB network. This observation violates our expectations and our perceptions about the two-stream architecture on the classic video classification task. Under the framework of the traditional two-stream networks that take optical flows as inputs, the only way to improve the overall cross-dataset performance is to improve the generalization ability of optical flow stream. In this paper, we do not intend to analyze the underlying reasons for these observations. We only conclude that neither 2D CNNs nor 3D CNNs shows great generalization ability with optical flow inputs. But we conjecture that optical flow inputs across datasets may inherently have less transferable features. It may not be an appropriate way to improve the overall cross-dataset performance. Later, we will show that our proposed Rev2Net model can improve the performance on the same cross-dataset settings (as shown in Table 4).

Decision Discrepancy in Two-Stream Networks.

The less transferable optical flow features affects the recognition accuracy of two-stream 3D CNNs across diverse video environments. The great gap in performance between the RGB stream and optical flow stream implies the great discrepancy between the two network streams. In classic video recognition settings, the optical flow stream achieves high performance that covers this disadvantage. Thus, the final ensemble model can easily yield an accuracy boost by simply averaging the outputs from the two streams. However, in the cross-dataset settings, the discrepancy may deteriorate the robustness and the generalization ability of the entire two-stream models. We may conclude that from the perspective of robustness, the optical flows are not used appropriately.

A Brief Summary of the Robustness Analysis Regarding Existing Models.

As we can see from above, first, both TSN and I3D may break down when either the long-term or short-term frames order is disrupted, showing that they may not be robust in some real-time applications in which the arriving frames can be disordered. Second, from the domain-adaptation experiments, we can see both TSN and I3D suffer from a severe performance degradation. Existing video action recognition models do not consider the cross-domain robustness to diverse video environments. Third, the two network streams in 3D CNNs are not ideally complementary. Through the domain-adaptation experiments, we find that the two-stream network architecture does not benefit the cross-dataset recognition results. The overall architecture lacks the ability to overcome the discrepancy between the optical flow stream and the RGB stream.

5 Reversed Two-Stream Networks

In this section, we first present a model called Reversed Two-Stream Networks (Rev2Net) for robust action recognition. We then discuss a corresponding method to mitigate the discrepancy of the two decoder streams in Rev2Net.

Figure 3: A schematic of the Rev2Net model with decoding discrepancy penalty (DDP). Inc here means the inception submodule in the inflated Inception-V1 architecture [1]. The two decoder streams in Rev2Net predicts corresponding optical flows and reconstruct the RGB inputs in a reversed order. DDP is used for overcoming the discrepancy of two decoder streams, allowing them to be trained collaboratively.

5.1 Reversing Input Signals as Self-supervisions

As mentioned above, the two-stream network architecture is widely used in video action recognition. The recent 3D CNN models also adopt this structure to improve their performance. The two network streams respectively extracts spatiotemporal representations from the RGB frames or optical flows. There is a problem in this two-stream network that it lacks enough capability to solve the discrepancy of the results obtained from two input modalities. This problem is not severe in classic action recognition settings, but it is severe in our preliminary experiments described in the above two section. As the optical flow stream shows limited generalization ability in cross-dataset settings, this stream severely affects the overall performance of the two-stream model. In our settings, using optical flows as inputs is not an ideal approach for learning robust video features.

Our idea is to design a robust neural network with a better generalization ability by both reversing the optical flow inputs and the RGB inputs as self-supervisions, and train this Reversed Two-Stream Network (Rev2Net) in a multi-task framework. A schematic of Rev2Net is shown in Figure 3. The Rev2Net has four components: one encoder stream with RGB inputs, one classifier for action recognition, one decoder stream for optical flows prediction, and another decoder stream for reversed RGB frames reconstruction. The encoder operates on consecutive video frames. Along with the classifier, it has the same architecture as I3D. The two decoder streams are composed of 3D transpose convolutions. The flow decoder aims to emphasize learning the short-term motion features as well the foreground appearance features by using corresponding flow fields as supervisions. The frame decoder, on the other hand, aims to emphasize learning the long-term motion features by reconstructing the input frames in a reversed order from encoded 3D feature volume, which can be viewed as an information bottleneck to force the model learn high-level video representations. These two decoder streams are only used at training time, and will be removed at test time.

The two decoder streams that respectively emphasizes short-term and long-term frames dependencies allows the Rev2Net model to capture the short-term and long-term motion cues adaptively, thus improving its robustness to frames disorder in some real-time applications where the order of the arriving frames cannot be guaranteed. On the other hand, by using optical flow as supervision signals, Rev2Net makes decisions without the aid of optical flow inputs, or in other words, without the disturbance of optical flows for the video action domain adaptation settings, considering that the optical flow network behaves poorly in the cross-dataset experiments. But the encoder stream in Rev2Net can still learn useful information from the flow decoder at the training time. We believe this is a more appropriate way to incorporate the merits of optical flows and simultaneously avoids its shortcomings. The empirical results shows that the Rev2Net architecture has better robustness and generalization ability.

5.2 Decoding Discrepancy Penalty

As we have seen, traditional two-stream networks suffer from the decision discrepancy between the two network streams in cross-dataset settings. Our Rev2Net model avoids the decision discrepancy by employing only one RGB encoder and reversing optical flow inputs as well as the RGB inputs as training supervisions, but it has to confront the discrepancy between the two decoders in feature learning effects. Though the the optical flows prediction task and the reversed frames reconstruction task will force the encoder to focus on different parts of the input frames, they may not have the commonly positive and complimentary effects on the training the encoder network. To this end, we penalize the feature space distance between the two decoder streams by defining a new objective function called the decoding discrepancy penalty (DDP). We propose two forms of DDP and their corresponding network designs, and explore the effects of applying the two decoder streams with DDP to the low-level or high-level video representations.

Frobenius Norm and Low-Level Features.

As shown in Figure 3, we build the two decoder streams based on the low level feature maps of the encoder. In other words, both the decoders and the encoder have only a few convolutional layers. Correspondingly, we use the Frobenius norm to penalize the distance of the decoders in the low-level feature space:


where is the Frobenius norm, is a set of convolutional layers included in the DDP, and and are the low-level feature maps of the optical flows decoder stream and the reversed frames decoder stream at layer .

Network details can be found in Figure 3. There are two TransposeConv3D layers in the flow decoder, and three of them in the frame decoder plus a Conv3D layer for generating the background which is not shown specifically. These decoders are plugged into the original I3D network, standing on the feature maps of the Conv3d_2c layer, taking a feature volume of as the inputs. Intuitively, we do not want the decoders to be too strong, since training a good feature encoder may require relatively weak decoders.

KL Divergence and High-Level Features.

We can also mitigate the high-level feature learning discrepancy by allocating more layers into the encoder network and its reversed decoder counterparts. In this method, particularly, the encoder has three parts of outputs: the features that are the inputs of the classifier, the mean

and the variance

of the Gaussian distribution

that are used for optical flows prediction, and the the mean and the variance of the Gaussian distribution that are used for reversed frames reconstruction. We propose to apply KL divergence on the high-level, low-dimensional outputs of the encoder, i.e. , to close the distance of two Gaussian distributions and :


Along with the optical flow prediction objective function or the reversed frames reconstruction objective function, the decoders can be trained in a similar way as variational autoencoders. By applying DDP to the overall loss function in the training process, including either the Frobenius norm or the KL divergence, we allow the two decoder streams to negotiate and collaborate with each other to improve the consistency of their effects on learning a better encoder.

Model Input Modality No Shuffle Long-term Shuffle Short-Term Shuffle Complete Shuffle
I3D [1] RGB 84.5 76.8 77.6 65.4
Rev2Net RGB 94.6 92.3 91.8 60.3
Table 3: The results of our proposed Rev2Net model and the I3D model [1] in the UCF101 frames shuffle experiments. More preliminary results can be referred to Table 1. Rev2Net is robust as long as either the long-term or the short-term frames dependency still exists. When shuffling all frames completely, the performance of Rev2Net reasonably decreases due to its dependency on the long-term or short-term frames order. This result indicates its difference from an image classification model which will be fully robust to all three shuffling schemes.

5.3 Final Objectives

The final Rev2Net architecture with Frobenius norm DDP is shown in Figure 3. The three tasks, i.e. action recognition, flows prediction, frames reconstruction; along with their corresponding four loss function terms including the DDP loss, can be jointly trained as follows:


where is the cross-entropy loss between the softmax output of the classifier and the ground truth action label . are the generated optical flows, and are the target optical flows which are pre-computed using TLV1 [34]. is the loss weight for the optical flows prediction task. Similarly, are generated frames, and are real input frames. is the loss weight for frames reconstruction task.

6 Experiments

This section is organized in accordance with the progress of our work. Initially, we introduce the datasets. Next, in order to prove the great generalization and transferable ability of the Rev2Net, we explore the cross-dataset experiments and shuffle experiments. At last, we compare its performance with state-of-the-art action recognition approaches. In the previous section, we discuss the Frobenius norm and the KL divergence for the DDP loss, whereas we evaluate the former one in experiments and leave the other one for future work.

6.1 Datasets

We train and evaluate the Rev2Net network on three standard datasets: (1) UCF101 [22] are mainly composed of YouTube videos. It has annotated video snippets from action categories. Each snippet lasts - seconds and consists of - frames. (2) HMDB51 [16] contains videos clips collected mostly from movies, and covers action categories. They are quite different in appearance. We adopt the standard training and evaluation protocols on both of them. (3) Kinetics [28] contains annotated video from human action classes, with at least video clips for each action. Each clip lasts around seconds and is taken from a different YouTube video. The actions are human focused and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands.

6.2 Robustness: Long-Term or Short-Term

Table 3 shows a comparison of the results of our proposed Rev2Net model and the I3D model [1] in the UCF101 frames shuffle experiments. More preliminary results can be referred to Table 1. Rev2Net behaves robust as long as either the long-term dependency or the short-term dependency still exist, indicating that the Rev2Net performs more robustly to temporal variations. When shuffling all frames completely, the performance of Rev2Net reasonably decreases due to its dependency on the long-term or short-term frames order, indicating its difference from an image classification model which will be fully robust to all three shuffling schemes.

6.3 Robustness: Cross-dataset Experiments

Under the framework of the traditional two-stream networks, the only way to improve the overall cross-dataset performance is to improve the generalization ability of the optical flow stream. But as above, we have found that two-stream networks cannot perform well in cross-dataset experiments because the optical flow features are hard to be transferred. The Rev2Net model resolves this problem in a distinct way. As we can see in Table 4, outperforms both the two-stream TSN and I3D models in this task, shown to have better generalization ability. The results also show the effectiveness of reversing optical flow data as self-supervisions, and exploiting the decoding discrepancy penalty to bridge the features respectively learned by the optical flow decoder and the reversed frame decoder.

Model Input Modality H U U H
TSN [30] RGB 58.5 40.1
I3D [1] RGB 56.4 41.0
Rev2Net RGB 63.1 46.8
Table 4: Further cross-dataset action recognition results. Same as Table 2, U: UCF101. H: HMDB51. And we also apply DANN [8] to Rev2Net for exploring the maximum generalization ability.
Model Input UCF HMDB
CNN + LSTM [4] RGB 68.2 -
TSSN [20] RGB 73.0 40.5
CPSN [7] RGB 82.6 -
SSR [6] RGB 82.3 43.4
Spatial TDD [4] RGB 82.8 50.0
TSN [30] RGB 86.4 53.7
ActionFlownet [12] RGB 83.9 56.4
Sun-OFF [24] RGB 93.3 -
I3D [1] RGB 84.5 49.8
TSN [30] R + F 94.0 69.4
Sun-OFF [24] R + F 96.0 74.2
I3D [1] R + F 93.4 66.4
I3D* [1] R + F 98.0 80.7
Rev2Net w/o Rev. RGB Dec. RGB 93.8 69.6
Rev2Net w/o Flow Dec. RGB 93.1 67.5
Rev2Net w/o DDP RGB 93.3 69.7
Rev2Net RGB 94.6 71.1
Rev2Net R + F 97.1 78.0
Table 5: We compare the recognition results of our Rev2Net and other state-of-the-art approaches on the UCF101 and HMDB51 datasets. I3D* means an I3d model that was pretrained on kinetics. Other models are all pre-trained on ImageNet.
Model Input Modalities Kinetics
I3D [1] RGB 71.1
I3D [1] RGB + Flow 74.2
Rev2Net w/o DDP RGB 72.7
Rev2Net w/o DDP RGB + Flow 74.1
Rev2Net RGB 73.3
Rev2Net RGB + Flow 74.8
Table 6: We compare the recognition results of our Rev2Net and other state-of-the-art approaches on the Kinetics datasets. All models are pre-trained on ImageNet. Rev2Net with or without optical flow inputs outperforms the others with same input madalities. Still the decoding discrepancy penalty consistently works well.

6.4 Final Results

Comparison with the State-of-the-Art.

Firstly, we compare the performance of our final architecture with the state-of-the-art approaches on the UCF101 and HMDB51 datasets. To efficiently include more methods into the comparison, we use ImageNet pre-trained model as our base network, which is the same as and adopted from the compared models. Results are shown in Table 5. The Rev2Net model outperforms all compared models on both UCF101 and HMDB51. Initialized with the same pre-trained model, Rev2Net only with RGB inputs achieves even better results than that of the two-stream I3d model with both RGB and optical flow inputs (94.6% vs. 93.4%). Furthermore, Rev2Net only with RGB inputs use less memory footprints at test time.

On the large scale Kinetics dataset, our model also achieves competitive results, as show in Table 6. With the same type of input data, Rev2Net consistently outperforms I3D. It is also worth noticing that, at test time, our network has the same architecture as the I3D model. Thus, there are no extra computations or memory footprints compared with I3D when we predict the action categories at test time.

We need to clarify that we admit the necessity of the optical flow inputs on the classic action recognition task. Actually, using another flow stream as the tradition two-stream models indeed helps the classification performance on all three datasets, so long as the two-stream models are trained and evaluated on the same dataset. There is no conflict between these results and the previous cross-dataset results. Still, our method with RGB inputs has not only achieved satisfying performance on single dataset, but also shown more robustness and more generalization ability across datasets.

Ablation Study.

We evaluate different Rev2Net models that are trained with two types of objective functions and three types of architectures for the decoder streams. Results in Table 5 suggest that both the flow decoder for the optical flows prediction task and the frame decoder for the frames reconstruction task make contributions to the recognition performance. Also, a Rev2Net model without being trained with the decoding discrepancy penalty (DDP) performs even worse than a Rev2Net model only with the frame decoder stream ( vs. ). This result indicates that the a single decoder stream in Rev2Net may not always perform well if we don’t apply the DDP method to restrict the feature learning discrepancy between the two decoder streams. The final Rev2Net (RGB) trained with DDL outperforms both Rev2Net trained without DDL ( vs.) and its one-decoder counterparts ( vs. , ).

7 Conclusions

In this paper, we designed two experiments to investigate the feature properties of the mainstream video action recognition models, including 2D CNNs and 3D CNNs. First, we use the frames shuffle experiment to study how much these models rely on motion dependencies. Second, we use the video action domain adaptation experiment to evaluate the generalization ability of these models. We found that taking optical flow data as inputs may deteriorate the generalization ability. Thus, we proposed a new network architecture based on 3D CNNs for robust and transferable video action recognition. We reversed the optical flow data and the RGB data as the supervisions of two correlated decoders. Thus, we named our model Reversed Two-Stream Networks (Rev2Net). Furthermore, we also proposed a decoding discrepancy penalty (DDP) as a corresponding training approach regarding the Rev2Net architecture. DDP is used for mitigate the feature learning discrepancy of the two decoder streams, forcing them consistently and adaptively benefit the classification task. Finally, we showed that our Rev2Net model trained with the DDP objective outperforms the state-of-the-art methods on three datasets: UCF101, HMDB51, and Kinetics. We also showed that our model has more robust performance in both the shuffle experiments and the cross-dataset video domain adaptation experiments.