Learning a robust video representation is the foundation of action recognition. It goes beyond image classification and depends on a joint modeling of both spatial and temporal cues. Basically, there are two most dominant and well-performed deep learning architectures in this area. The first one is to capture video appearance patterns with standard CNNs[15, 21, 25, 9, 10] and try out using optical flow data in a second CNN stream to capture the motion cues [20, 33, 7, 35, 30]. This standard 2D CNN framework are not specifically designed for videos and cannot fully exploit spatiotemporal features. According to , the high action recognition performance of these two-stream 2D CNN models mostly relies on the invariant appearance of the optical flow data. Another influential deep architecture is the 3D CNNs [26, 11, 27], which contain 3D convolutions and 3D pooling layers operating over space and time simultaneously, thus realize a unified spatiotemporal modeling. Some more recent methods [1, 17, 32] combine 3D CNNs with the above two-stream architectures and achieve state-of-the-art results.
We believe that a good action recognition model needs to be robust to temporal disorder (considering in real-time applications, video frames can arrive out of order). It should have the ability to adaptively make decisions according to either the long-term motion cues or the short-term motion cues. We also believe that good video representations need to be more transferable. They should lead to an easy recognition even when transferred from the source domain to the target domain. Say, we train a model to recognize an action category of playing basketball with training videos in which people play basketball in outdoor courts. It would be unreasonable that our model cannot recognize playing basketball in indoor courts. However, most previous work was not focused on improving the robustness of the video action recognition models. In this paper, we first show that two-stream 2D CNNs and 3D CNNs are easy to break down with only limited disturbance of the video frames dependencies. We then introduce an video action domain adaptation experiment to show that the optical flow features are NOT transferable across datasets, which severely affects the generalization ability of existing action recognition models.
From the above experiments, we find that when making different decisions, the existing two-stream CNNs may fail to solve the discrepancy of the results obtained from different modalities appropriately, even though previous work attempted to solve this issue with relatively simple feature fusion methods such as concatenation, convolution, or bilinear pooling [33, 29]. We conjecture that such a discrepancy may deteriorate the robustness of action recognition models.
To solve this problem, we propose a new neural network for video action recognition, named the Reversed Two-Stream Networks (Rev2Net), as shown in Figure 1(d)
. We show that compared with the previous methods, this architecture can learn more generalized and robust video features. More specifically, Rev2Net consists of one encoder stream with RGB inputs, one classifier for action recognition, one decoder stream for optical flows prediction, and another decoder stream for reversed RGB frames reconstruction. The decoder streams, in a reversed form of the original two-stream networks, use different modalities of video frames as self-supervision signals. During training, the encoded video features diverge into three network branches. Whereas, at test time, Rev2Net does not need to take the pre-computed optical flow data as inputs and purely makes decisions based on raw RGB frames. Considering that the auxiliary self-supervised tasks are impossible to be helpful to the final task all the time, and sometimes they may diverge the encoder representations away from the recognition task, we penalize the feature space discrepancy between the two decoder streams. This penalty enables an adaptive learning between the two prediction tasks. In other words, the two decoder streams could interact with each other through the backpropagation, and may reach an agreement that which one (the optical flows or the reversed RGB images) can provide more useful information for the encoder feature learning.
Technically, this paper has three contributions: (1) we analyze the robustness of mainstream video action recognition models to disturbance of motion dependencies and variations of video environments, by conducting long-term and short-term frame shuffle experiments, as well as video action domain adaptation experiments. (2) we present a novel deep network called Rev2Net with better generalization abilities (more transferable) using self-supervised multi-task learning methods. The most distinctive features of Rev2Net is the decoding discrepancy penalty called DDP. It enables an adaptive, collaborative multi-task learning between different decoder streams by penalizing their disagreement in the deep feature space. (3) The Rev2Net does not require the pre-computed optical flows at test time. It achieves competitive results on classic action recognition tasks: 94.6% for UCF-101, 71.1% for HMDB-51 and 73.3% for Kinetics, even better than some state-of-the-art methods who take optical flows as inputs.
2 Related Work
Ever since the great impact of CNNs upon image classification, many researchers have been trying out reusing CNNs for video action recognition in recent years. According to an early neurological study , the motion cue is the primary reason for humans to recognize a range of actions. To capture the motion patterns in the spatiotemporal data, researchers have explored various deep network architectures, including practicing different connectivity methods in 2D CNNs , using non-local networks, proposing 3D CNNs by extending convolutional filters into the time domain [26, 11, 27], using temporally recurrent layers to aggregate features across longer video inputs [4, 33], as well as training a two-stream ensemble network with a second stream of CNNs fed with pre-computed optical flow frames . Among all these architectures, the two-stream networks with optical flow inputs and the 3D or “pseudo” 3D CNNs have been most widely explored.
The two-stream networks were first introduced to the deep video classification models by Simonyan et al.  which model short temporal snapshots of videos by averaging the predictions from a single RGB frame and a stack of computed optical flow frames. Noticing that RGB images could not fully exploit temporal cues, they extracted the optical flow from consecutive video frames and took them as the network inputs. The optical flow stream has brought in a significant performance gain. Since then, two-stream networks have been widely employed by many video action recognition models [2, 33, 7, 30, 29], including some 3D CNN models [1, 17, 32].
The 3D convolutions have been explored more than once [26, 11, 27]. A recent boom started with C3D , a 3D version of VGGnet  that contained 3D convolutions and the 3D pooling operating over space and time simultaneously. Intuitively, C3D realized a unified modeling of the spatiotemporal features. But a side effect comes with it, that the 3D convolutions bring an inevitable increase in the number of network parameters, making C3D hard to train. To solve this problem, Carreira et al. 
proposed the Inflated 3D ConvNets (I3D). As its name, I3D inflates 2D convolutional filters into 3D, making the 3D models implicitly pre-trained on ImageNet. Other designs in the very recent year focused on how to reduce the memory footprint and ease the training process of 3D CNNs. P3D and S3D  both replaced the convolutional filter plus a spatial convolution and a temporal convolution. T3D  uses a stream 3D DenseNet with multi-depth temporal pooling layer in order to get lower model footprint compared to I3D. Those 3D CNNs models achieved state-of-the-art performance especially when being incorporated into the two-stream architecture.
Action Recognition Models with Optical Flow Synthesis.
The optical flow is crucial for the performance of two-stream networks. For a better use of optical flow frames, some recent methods [19, 36, 23] went beyond taking them as network inputs. They showed that training to generate optical flows with some deep networks, e.g. FlowNet  and SpyNet , could improve the recognition performance. Sevilla-Lara et al.  also tried to interpret the correlations between the optical flow and action recognition results, regarding the CNN model as a “black box”. They conducted an interesting experiment by shuffling the flow frames randomly before they were fed into a 2D CNN model called TSN . They argued that the power of optical flow mainly came from its invariance to the frame appearance, instead of the modeling capability to long-term motion cues. It is potentially true for these 2D CNNs since most of them are ensemble models across different sampling time stamps.
Another way to use optical flow, which was come up by ActionFlowNet 
previously, is to estimate optical flow by 3D CNNs. It improved action classification accuracy with the help of motion information and did not need much optical flow data as inputs. But predicting the optical flow alone will not have a consistently positive effect on learning good features, which can hardly improve the robustness and the generalization ability of this model.
3 Analysis of Robustness to Frames Disorder
Modeling temporal dynamics is essential for the video classification performance. However, considering that in some real-time applications, video frames may arrive out of order. Thus, there is a dilemma in video action recognition: on one hand, we want our model extract useful motion cues from the video frames dependency, which is the most distinct difference from the image classification problem. On the other hand, too much concentration on the coherent temporal information will lead the model less robust to the disordered frames in some real-time recognition settings. An ideal resolution to this dilemma is to learn an adaptive model which can adaptively depends its decisions on the short-term or long-term video frame dependency. Correspondingly, the most ideal situation is that actions could be distinguished as long as either the short-term video order or the long-term video order is correct. We do not discuss how to recognize totally disordered frames, as such situations are not very meaningful in the realm of video analysis.
In this section, we analyze the robustness of the existing mainstream video action recognition models using frames shuffle experiments. Plus, it remains unclear whether long-term of short-term frame dependencies are successfully modeled by 3D CNNs. We expand the work of  from 2D CNNs to 3D CNNs and conduct frames shuffle experiments on the state-of-the-art I3D model . We train this model with correct-order frames, evaluate it with disordered frames, and measure how much the accuracy discrepancy will be. Three shuffle schemes are used: a long-term shuffle, a short-term shuffle and a complete shuffle. Concretely, we organize frame sequences into frame blocks by putting consecutive frames in each block. The long-term shuffle scheme keeps the order inside each block and rearranges blocks in a random order. This maintains the short-term dependency but breaks the long-term dependency. The short-term shuffle scheme behaves in an opposite way. It shuffles within each block but keeps the order of these blocks, which breaks long-term video dependencies. The complete shuffle scheme randomizes the order of all frames.
|Model||Input Modalities||No Shuffle||Long-Term Shuffle||Short-Term Shuffle||Complete Shuffle|
|TSN||Flow||86.85 ||-||78.64 ||59.55 |
As shown in Table 1, we can see that: (1) Under the complete shuffle scheme, the performance of the flow network () decreases by a greater extent than that of the RGB network (), indicating that the I3D flow network is more dependent on some underlying motion cues than the I3D RGB network. But on the other hand, we may conclude that the flow network is less robust to the complete shuffle. (2) The other two shuffle experiments shows that modeling long-term and short-term motion have almost equal importance to the performance of I3D. But from the perspective of the dilemma that we have mentioned above, the 3D CNNs are easy to deteriorate when either the long-term disorder or the short-term disorder exists. Such an action recognition algorithm may be fragile in real-time applications. For a complete comparison, we borrow the TSN shuffle experiment results from  which are shown in the third line of Table 1. Here, we do not discuss the long-term shuffle test because the final classification score of TSN is averaged over several short snippets sampled from the whole video clip and is not sensitive to the order of them. Apparently the 2D CNN model cannot make full use of long-term motion cues. From another perspective of the dilemma mentioned above, it is not an ideal algorithm either, as it may perform worse in a fully controlled environment where the order of the arriving frames can be guaranteed.
4 Analysis of Generalization Ability
Consider that we train a model to recognize the action of playing basketball with videos in which people do this in outdoor courts. One may wonder whether this model can easily recognize the action of playing basketball in indoor courts. We believe that good features for action recognition should be easily transferred from one dataset to another.
|Model||Input Modalities||H U||U H|
|TSN||RGB + Flow||60.9||40.3|
|I3D||RGB + Flow||57.9||41.5|
Video Action Domain Adaptation.
Domain adaptation is an effective way to verify the generalization ability of video action recognition models. The purpose of domain adaptation is to learn a model from the distribution of the source domain, and to apply this model on the target domain with different (but related) distribution. As video domain adaptation is under-explored by the previous work using neural networks, we select related categories from UCF101 and HMDB51. Note that these two datasets have diverse data patterns: HMDB51 is mostly collected from movies, while UCF101 is collected from YouTube and appears to be closer to the real life. Even for the same action category, the video appearance of these two datasets are quite different, e.g. the scene complexity and the camera angles as shown in Figure 2. The distance between the distributions of video form the source dataset and the target dataset causes the main difficulties for action knowledge transfer. To evaluate the maximum transferable ability of the mainstream video action recognition models (TSN for 2D CNNs and I3D for 3D CNNs), we incorporate an domain adaptation method DANN 
, which was originally proposed for image transfer learning, into these models. DANN closes the distribution distance by matching the mean embeddings in the feature space across domains. We train TSN and I3D models with labels from the source dataset, and evaluate them on the target dataset.
Table 2 shows the cross-dataset results of TSN and I3D with one or two input data modalities. When taking RGB frames as inputs, the 2D CNNs and 3D CNNs have comparable results, even though 2D CNNs are incapable of learning motion dependencies from consecutive frames. Surprisingly, both for TSN or I3D, their network streams with RGB inputs outperform the the optical flow streams in the domain adaptation settings. Consequently, the overall recognition accuracy of the two-stream networks yields no further improvement compared with the one-stream RGB network. This observation violates our expectations and our perceptions about the two-stream architecture on the classic video classification task. Under the framework of the traditional two-stream networks that take optical flows as inputs, the only way to improve the overall cross-dataset performance is to improve the generalization ability of optical flow stream. In this paper, we do not intend to analyze the underlying reasons for these observations. We only conclude that neither 2D CNNs nor 3D CNNs shows great generalization ability with optical flow inputs. But we conjecture that optical flow inputs across datasets may inherently have less transferable features. It may not be an appropriate way to improve the overall cross-dataset performance. Later, we will show that our proposed Rev2Net model can improve the performance on the same cross-dataset settings (as shown in Table 4).
Decision Discrepancy in Two-Stream Networks.
The less transferable optical flow features affects the recognition accuracy of two-stream 3D CNNs across diverse video environments. The great gap in performance between the RGB stream and optical flow stream implies the great discrepancy between the two network streams. In classic video recognition settings, the optical flow stream achieves high performance that covers this disadvantage. Thus, the final ensemble model can easily yield an accuracy boost by simply averaging the outputs from the two streams. However, in the cross-dataset settings, the discrepancy may deteriorate the robustness and the generalization ability of the entire two-stream models. We may conclude that from the perspective of robustness, the optical flows are not used appropriately.
A Brief Summary of the Robustness Analysis Regarding Existing Models.
As we can see from above, first, both TSN and I3D may break down when either the long-term or short-term frames order is disrupted, showing that they may not be robust in some real-time applications in which the arriving frames can be disordered. Second, from the domain-adaptation experiments, we can see both TSN and I3D suffer from a severe performance degradation. Existing video action recognition models do not consider the cross-domain robustness to diverse video environments. Third, the two network streams in 3D CNNs are not ideally complementary. Through the domain-adaptation experiments, we find that the two-stream network architecture does not benefit the cross-dataset recognition results. The overall architecture lacks the ability to overcome the discrepancy between the optical flow stream and the RGB stream.
5 Reversed Two-Stream Networks
In this section, we first present a model called Reversed Two-Stream Networks (Rev2Net) for robust action recognition. We then discuss a corresponding method to mitigate the discrepancy of the two decoder streams in Rev2Net.
5.1 Reversing Input Signals as Self-supervisions
As mentioned above, the two-stream network architecture is widely used in video action recognition. The recent 3D CNN models also adopt this structure to improve their performance. The two network streams respectively extracts spatiotemporal representations from the RGB frames or optical flows. There is a problem in this two-stream network that it lacks enough capability to solve the discrepancy of the results obtained from two input modalities. This problem is not severe in classic action recognition settings, but it is severe in our preliminary experiments described in the above two section. As the optical flow stream shows limited generalization ability in cross-dataset settings, this stream severely affects the overall performance of the two-stream model. In our settings, using optical flows as inputs is not an ideal approach for learning robust video features.
Our idea is to design a robust neural network with a better generalization ability by both reversing the optical flow inputs and the RGB inputs as self-supervisions, and train this Reversed Two-Stream Network (Rev2Net) in a multi-task framework. A schematic of Rev2Net is shown in Figure 3. The Rev2Net has four components: one encoder stream with RGB inputs, one classifier for action recognition, one decoder stream for optical flows prediction, and another decoder stream for reversed RGB frames reconstruction. The encoder operates on consecutive video frames. Along with the classifier, it has the same architecture as I3D. The two decoder streams are composed of 3D transpose convolutions. The flow decoder aims to emphasize learning the short-term motion features as well the foreground appearance features by using corresponding flow fields as supervisions. The frame decoder, on the other hand, aims to emphasize learning the long-term motion features by reconstructing the input frames in a reversed order from encoded 3D feature volume, which can be viewed as an information bottleneck to force the model learn high-level video representations. These two decoder streams are only used at training time, and will be removed at test time.
The two decoder streams that respectively emphasizes short-term and long-term frames dependencies allows the Rev2Net model to capture the short-term and long-term motion cues adaptively, thus improving its robustness to frames disorder in some real-time applications where the order of the arriving frames cannot be guaranteed. On the other hand, by using optical flow as supervision signals, Rev2Net makes decisions without the aid of optical flow inputs, or in other words, without the disturbance of optical flows for the video action domain adaptation settings, considering that the optical flow network behaves poorly in the cross-dataset experiments. But the encoder stream in Rev2Net can still learn useful information from the flow decoder at the training time. We believe this is a more appropriate way to incorporate the merits of optical flows and simultaneously avoids its shortcomings. The empirical results shows that the Rev2Net architecture has better robustness and generalization ability.
5.2 Decoding Discrepancy Penalty
As we have seen, traditional two-stream networks suffer from the decision discrepancy between the two network streams in cross-dataset settings. Our Rev2Net model avoids the decision discrepancy by employing only one RGB encoder and reversing optical flow inputs as well as the RGB inputs as training supervisions, but it has to confront the discrepancy between the two decoders in feature learning effects. Though the the optical flows prediction task and the reversed frames reconstruction task will force the encoder to focus on different parts of the input frames, they may not have the commonly positive and complimentary effects on the training the encoder network. To this end, we penalize the feature space distance between the two decoder streams by defining a new objective function called the decoding discrepancy penalty (DDP). We propose two forms of DDP and their corresponding network designs, and explore the effects of applying the two decoder streams with DDP to the low-level or high-level video representations.
Frobenius Norm and Low-Level Features.
As shown in Figure 3, we build the two decoder streams based on the low level feature maps of the encoder. In other words, both the decoders and the encoder have only a few convolutional layers. Correspondingly, we use the Frobenius norm to penalize the distance of the decoders in the low-level feature space:
where is the Frobenius norm, is a set of convolutional layers included in the DDP, and and are the low-level feature maps of the optical flows decoder stream and the reversed frames decoder stream at layer .
Network details can be found in Figure 3. There are two TransposeConv3D layers in the flow decoder, and three of them in the frame decoder plus a Conv3D layer for generating the background which is not shown specifically. These decoders are plugged into the original I3D network, standing on the feature maps of the Conv3d_2c layer, taking a feature volume of as the inputs. Intuitively, we do not want the decoders to be too strong, since training a good feature encoder may require relatively weak decoders.
KL Divergence and High-Level Features.
We can also mitigate the high-level feature learning discrepancy by allocating more layers into the encoder network and its reversed decoder counterparts. In this method, particularly, the encoder has three parts of outputs: the features that are the inputs of the classifier, the mean
and the variance
of the Gaussian distributionthat are used for optical flows prediction, and the the mean and the variance of the Gaussian distribution that are used for reversed frames reconstruction. We propose to apply KL divergence on the high-level, low-dimensional outputs of the encoder, i.e. , to close the distance of two Gaussian distributions and :
Along with the optical flow prediction objective function or the reversed frames reconstruction objective function, the decoders can be trained in a similar way as variational autoencoders. By applying DDP to the overall loss function in the training process, including either the Frobenius norm or the KL divergence, we allow the two decoder streams to negotiate and collaborate with each other to improve the consistency of their effects on learning a better encoder.
|Model||Input Modality||No Shuffle||Long-term Shuffle||Short-Term Shuffle||Complete Shuffle|
5.3 Final Objectives
The final Rev2Net architecture with Frobenius norm DDP is shown in Figure 3. The three tasks, i.e. action recognition, flows prediction, frames reconstruction; along with their corresponding four loss function terms including the DDP loss, can be jointly trained as follows:
where is the cross-entropy loss between the softmax output of the classifier and the ground truth action label . are the generated optical flows, and are the target optical flows which are pre-computed using TLV1 . is the loss weight for the optical flows prediction task. Similarly, are generated frames, and are real input frames. is the loss weight for frames reconstruction task.
This section is organized in accordance with the progress of our work. Initially, we introduce the datasets. Next, in order to prove the great generalization and transferable ability of the Rev2Net, we explore the cross-dataset experiments and shuffle experiments. At last, we compare its performance with state-of-the-art action recognition approaches. In the previous section, we discuss the Frobenius norm and the KL divergence for the DDP loss, whereas we evaluate the former one in experiments and leave the other one for future work.
We train and evaluate the Rev2Net network on three standard datasets: (1) UCF101  are mainly composed of YouTube videos. It has annotated video snippets from action categories. Each snippet lasts - seconds and consists of - frames. (2) HMDB51  contains videos clips collected mostly from movies, and covers action categories. They are quite different in appearance. We adopt the standard training and evaluation protocols on both of them. (3) Kinetics  contains annotated video from human action classes, with at least video clips for each action. Each clip lasts around seconds and is taken from a different YouTube video. The actions are human focused and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands.
6.2 Robustness: Long-Term or Short-Term
Table 3 shows a comparison of the results of our proposed Rev2Net model and the I3D model  in the UCF101 frames shuffle experiments. More preliminary results can be referred to Table 1. Rev2Net behaves robust as long as either the long-term dependency or the short-term dependency still exist, indicating that the Rev2Net performs more robustly to temporal variations. When shuffling all frames completely, the performance of Rev2Net reasonably decreases due to its dependency on the long-term or short-term frames order, indicating its difference from an image classification model which will be fully robust to all three shuffling schemes.
6.3 Robustness: Cross-dataset Experiments
Under the framework of the traditional two-stream networks, the only way to improve the overall cross-dataset performance is to improve the generalization ability of the optical flow stream. But as above, we have found that two-stream networks cannot perform well in cross-dataset experiments because the optical flow features are hard to be transferred. The Rev2Net model resolves this problem in a distinct way. As we can see in Table 4, outperforms both the two-stream TSN and I3D models in this task, shown to have better generalization ability. The results also show the effectiveness of reversing optical flow data as self-supervisions, and exploiting the decoding discrepancy penalty to bridge the features respectively learned by the optical flow decoder and the reversed frame decoder.
|Model||Input Modality||H U||U H|
|CNN + LSTM ||RGB||68.2||-|
|Spatial TDD ||RGB||82.8||50.0|
|TSN ||R + F||94.0||69.4|
|Sun-OFF ||R + F||96.0||74.2|
|I3D ||R + F||93.4||66.4|
|I3D* ||R + F||98.0||80.7|
|Rev2Net w/o Rev. RGB Dec.||RGB||93.8||69.6|
|Rev2Net w/o Flow Dec.||RGB||93.1||67.5|
|Rev2Net w/o DDP||RGB||93.3||69.7|
|Rev2Net||R + F||97.1||78.0|
|I3D ||RGB + Flow||74.2|
|Rev2Net w/o DDP||RGB||72.7|
|Rev2Net w/o DDP||RGB + Flow||74.1|
|Rev2Net||RGB + Flow||74.8|
6.4 Final Results
Comparison with the State-of-the-Art.
Firstly, we compare the performance of our final architecture with the state-of-the-art approaches on the UCF101 and HMDB51 datasets. To efficiently include more methods into the comparison, we use ImageNet pre-trained model as our base network, which is the same as and adopted from the compared models. Results are shown in Table 5. The Rev2Net model outperforms all compared models on both UCF101 and HMDB51. Initialized with the same pre-trained model, Rev2Net only with RGB inputs achieves even better results than that of the two-stream I3d model with both RGB and optical flow inputs (94.6% vs. 93.4%). Furthermore, Rev2Net only with RGB inputs use less memory footprints at test time.
On the large scale Kinetics dataset, our model also achieves competitive results, as show in Table 6. With the same type of input data, Rev2Net consistently outperforms I3D. It is also worth noticing that, at test time, our network has the same architecture as the I3D model. Thus, there are no extra computations or memory footprints compared with I3D when we predict the action categories at test time.
We need to clarify that we admit the necessity of the optical flow inputs on the classic action recognition task. Actually, using another flow stream as the tradition two-stream models indeed helps the classification performance on all three datasets, so long as the two-stream models are trained and evaluated on the same dataset. There is no conflict between these results and the previous cross-dataset results. Still, our method with RGB inputs has not only achieved satisfying performance on single dataset, but also shown more robustness and more generalization ability across datasets.
We evaluate different Rev2Net models that are trained with two types of objective functions and three types of architectures for the decoder streams. Results in Table 5 suggest that both the flow decoder for the optical flows prediction task and the frame decoder for the frames reconstruction task make contributions to the recognition performance. Also, a Rev2Net model without being trained with the decoding discrepancy penalty (DDP) performs even worse than a Rev2Net model only with the frame decoder stream ( vs. ). This result indicates that the a single decoder stream in Rev2Net may not always perform well if we don’t apply the DDP method to restrict the feature learning discrepancy between the two decoder streams. The final Rev2Net (RGB) trained with DDL outperforms both Rev2Net trained without DDL ( vs.) and its one-decoder counterparts ( vs. , ).
In this paper, we designed two experiments to investigate the feature properties of the mainstream video action recognition models, including 2D CNNs and 3D CNNs. First, we use the frames shuffle experiment to study how much these models rely on motion dependencies. Second, we use the video action domain adaptation experiment to evaluate the generalization ability of these models. We found that taking optical flow data as inputs may deteriorate the generalization ability. Thus, we proposed a new network architecture based on 3D CNNs for robust and transferable video action recognition. We reversed the optical flow data and the RGB data as the supervisions of two correlated decoders. Thus, we named our model Reversed Two-Stream Networks (Rev2Net). Furthermore, we also proposed a decoding discrepancy penalty (DDP) as a corresponding training approach regarding the Rev2Net architecture. DDP is used for mitigate the feature learning discrepancy of the two decoder streams, forcing them consistently and adaptively benefit the classification task. Finally, we showed that our Rev2Net model trained with the DDP objective outperforms the state-of-the-art methods on three datasets: UCF101, HMDB51, and Kinetics. We also showed that our model has more robust performance in both the shuffle experiments and the cross-dataset video domain adaptation experiments.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 4724–4733. IEEE, 2017.
-  G. Chéron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. In CVPR, pages 3218–3226, 2015.
-  A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. Mahdi Arzani, R. Yousefzadeh, and L. Van Gool. Temporal 3d convnets: New architecture and transfer learning for video classification. In CVPR. IEEE, 2017.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
-  A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
-  C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. pages 3468–3476, 2016.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
-  Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
S. Ji, W. Xu, M. Yang, and K. Yu.
3d convolutional neural networks for human action recognition.TPAMI, 35(1):221–231, 2013.
-  J. N. L. S. D. Joe Yue-Hei Ng, Jonghyun Choi. Actionflownet: Learning motion representation for action recognition. In WACV, 2018.
-  G. Johansson. Visual perception of biological motion and a model for its analysis. Perception & psychophysics, 14(2):201–211, 1973.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, pages 2556–2563. IEEE, 2011.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pages 5534–5542. IEEE, 2017.
-  A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In CVPR, volume 2, 2017.
-  L. Sevilla-Lara, Y. Liao, F. Guney, V. Jampani, A. Geiger, and M. J. Black. On the integration of optical flow and action recognition. arXiv preprint arXiv:1712.08416, 2017.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  S. Sun, Z. Kuang, W. Ouyang, L. Sheng, and W. Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. arXiv preprint arXiv:1711.11152, 2017.
-  S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical flow guided feature: A fast and robust motion representation for video action recognition. In CVPR, June 2018.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
-  G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, pages 140–153. Springer, 2010.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497. IEEE, 2015.
-  K. S. B. C. H. S. V. F. T. G. T. B. P. N. M. S. W. Kay, J. Carreira and A. Zisserman. The kinetics human action video dataset. 2017.
-  L. Wang, W. Li, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. arXiv preprint arXiv:1711.09125, 2017.
-  L. Wang, Y. Xiong, and et al. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. CVPR, 2018.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851, 2017.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015.
C. Zach, T. Pock, and H. Bischof.
A duality based approach for realtime tv-l 1 optical flow.
Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang.
Real-time action recognition with enhanced motion vector cnns.In CVPR, 2016.
-  X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei. Flow-guided feature aggregation for video object detection. arXiv preprint arXiv:1703.10025, 2017.