Two-Stream Oriented Video Super-Resolution for Action Recognition

03/13/2019 ∙ by Haochen Zhang, et al. ∙ USTC 0

We study the video super-resolution (SR) problem not for visual quality, but for facilitating video analytics tasks, e.g. action recognition. The popular action recognition methods based on convolutional networks, exemplified by two-stream networks, are not directly applicable on videos of different spatial resolutions. This can be remedied by performing video SR prior to recognition, which motivates us to improve the SR procedure for recognition accuracy. Tailored for two-stream action recognition networks, we propose two video SR methods for the spatial and temporal streams respectively. On the one hand, we observe that the added details by image SR methods can be either helpful or harmful for recognition, and we propose an optical-flow guided weighted mean-squared-error loss for our spatial-oriented SR (SoSR) network. On the other hand, we observe that existing video SR methods incur temporal discontinuity between frames, which also worsens the recognition accuracy, and we propose a siamese network for our temporal-oriented SR (ToSR) that emphasizes the temporal continuity between consecutive frames. We perform experiments using two state-of-the-art action recognition networks and two well-known datasets--UCF101 and HMDB51. Results demonstrate the effectiveness of our proposed SoSR and ToSR in improving recognition accuracy.



There are no comments yet.


page 3

page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, convolutional neural networks (CNNs) have been applied to action recognition task and obtained state-of-the-art performance over the traditional arts. However, their widely application is hindered by video’s resolution. The relatively low resolution makes the task more complex. Firstly, most of the datasets used for studying action recognition have a fixed resolution,

e.g. UCF101 (about 320240), HMDB51 (about 340256), sport-1M (about 640360) and so on [11]. But the resolution in real world usually varies among different sources of video capturing, inevitably being low in e.g. surveillance scenario. There are also some situations where the region of interest (ROI) is quite small in relative high-resolution (HR) videos. Secondly, current recognition networks are not scale invariant due to the existence of fully-connected layers in their architecture. In other words, low-resolution (LR) videos cannot be directly fed into these well trained CNNs.

Re-training a new classifier for LR videos is a straightforward solution. However, there are some issues that limit its feasibility. Firstly, such method needs well labeled large-scale training data in the similar quality. Secondly, it would be very time consuming and laborious to train a video recognition network well. Moreover, we need to train many models for different input resolutions in this way. Another intuitive solution is to simply re-scale the input frames, such as using bicubic interpolation. Going further, super-resolution (SR) is an advanced alternative of simple re-scaling, which can benefit the classifier by adding details to LR inputs. Using SR as preprocessing eliminates the necessity to retrain classifiers. Previous work


has verified that image SR is generally helpful for computer vision tasks when dealing with LR input images.

[23] proposes to use some traditional SR methods for LR video action recognition, and [40] investigates a CNN-based SR method for action recognition.

Almost all of the existing works about image or video SR concern the visual quality of the super-resolved image or video. During training, mean-squared-error (MSE) is extensively used, which corresponds to signal fidelity i.e. PSNR. As it has been claimed that PSNR is not a good representative for visual quality, some perceptual loss is introduced in addition to MSE. Nonetheless, it is still not clear whether visual quality determines the quality of visual analytics results, e.g. action recognition accuracy. Intuitively, people may assume the visual quality and the “recognition quality” shall be consistent. But we argue that, since the analytics tasks are performed by computers instead of human, there can be inconsistency. In addition, it has been shown that for inverse problems like SR, even signal fidelity and perceptual quality can be contradictory [1], which further challenges the intuitive assumption.

In this paper, we study the video SR problem not for visual quality but for recognition quality, i.e. we use SR as a preprocessing step before feeding an LR video into a trained action recognition network. Since we focus on SR, we regard action recognition networks as “black boxes” and do not adjust them for LR video. In other words, we want to investigate an SR method whose performance is evaluated by a computer algorithm rather than by human. This work can also be viewed as bridging low-level and high-level vision tasks.

Oriented to the popular two-stream action recognition framework [26] which learns two separate networks, one for spatial color information and the other for temporal motion information, we propose two video SR methods for these two streams respectively. For the spatial stream which can be regarded as image classification, we observe that the moving object is more related to the recognition and should be paid more attention during SR enhancement. Thus, our Spatial-oriented SR (SoSR) takes weighted mean squared error guided by optical flow as loss to emphasize moving objects. For the temporal stream, we observe that video SR can result in the temporal discontinuity between consecutive video frames which may harm the quality of optical flow and incur drop in recognition accuracy. Thus, in our Temporal-oriented SR (ToSR), we enhance the consecutive frames together to ensure the temporal consistency in a video clip.

Our contributions can be summarized as follows.

  • We investigate SR methods to facilitate action recognition, assuming well-trained two-stream networks as “black boxes.”

  • For the spatial stream, we propose an optical flow guided weighted MSE loss to guide our SoSR to pay more attention to regions with action.

  • For the temporal stream, we propose ToSR which enhances the consecutive frames together to achieve temporal consistency.

To verify the effectiveness of our methods, we perform experiments with two state-of-the-art recognition networks on two widely used datasets–UCF101 [27] and HMDB51 [20]. Comprehensive experimental results show that our SoSR and ToSR indeed improve the recognition accuracy significantly. Especially, our SoSR is implemented upon a single frame SR network, but outperforms advanced multi-frame SR methods on recognition accuracy in the spatial stream; our ToSR obtains an accuracy of 61.24% and 58.73% on the HMDB51 dataset in the temporal stream, using two recognition networks respectively, which are quite close to the performance of HR videos: 62.16% and 59.41%. Our code will be released.

2 Related Work

We review related works at two aspects: action recognition and image/video SR. In both fields, CNN has been the mainstream and outperforms the traditional methods significantly. Thus we only mention several CNN-based approaches that are highly related to our work.

Figure 1: Our proposed two-stream oriented video SR for action recognition.

CNN for action recognition. In CNN-based action recognition, a key problem is how to properly incorporate spatial and temporal information in CNN architectures. Solutions can be divided into three categories: 3D convolution, RNN/LSTM, and two-stream. 3D CNN which learns spatio-temporal features was first presented in [14]. Later on, C3D features and 3D CNN architectures [33, 34, 36, 5] appeared. There were also several works [29, 24, 42] focusing on improvements of 3D CNNs. RNN/LSTM is believed to cope with temporal information better, and thus [6, 38, 37] attempted to incorporate LSTMs to deal with action recognition. Two-stream CNN architecture was firstly proposed in [26]. This architecture consists of two separate networks, one for exploiting spatial information from individual frames, and the other for using temporal information from optical flow; the outputs of two networks are then combined by late fusion. Several improvements were presented for two-stream [8, 9, 35]. In this paper, we design SR methods specifically for two-stream networks due to two reasons. First, two-stream approach seems leading to the best performance for action recognition on several benchmarks. Second, both 3D convolution and RNN/LSTM networks are not easily decomposed, but two-stream networks have a clear decomposition, which facilitates the investigation of SR. Specifically, we use two state-of-the-art methods known as Temporal Segment Network (TSN) [35] and Spatio-Temporal Residual Network (ST-Resnet) [8] in our experiments.

CNN for image SR.

Almost all of the existing image SR methods are designed to enhance the visual quality by adding more image details. In earlier years, PSNR is evaluated as a surrogate of visual quality and thus mean-squared-error is extensively used as loss function

[7, 19, 18, 25, 30, 31, 22, 41]. More recently, visual quality is considered directly and several different kinds of loss functions are proposed, such as perceptual loss [16] and loss defined by generative adversarial network (GAN) [10]. For example, Ledig et al. [21] proposed SRGAN which combined GAN loss and perceptual loss. It is also worth noting that PSNR and visual quality can be even contradictory [1].

CNN for video SR. Compared to single image SR, the temporal dimension provides much more information in video SR, and various methods have been proposed to exploit the temporal information. A majority of these methods have an explicit motion compensation module to align different frames. For example, Kappeler et al. [17] slightly modified SRCNN [7] and extracted features from frames that were aligned by optical flow. Caballero et al. [2] proposed an end-to-end SR network to learn motions between input LR frames and generate SR frames in real time. Tao et al. [32] introduced a new sub-pixel motion compensation (SPMC) layer to perform motion compensation and up-sampling jointly. Also several methods try to avoid the explicit motion compensation. For example, Jo et al. [15] proposed a network that used dynamic up-sampling filters. All the aforementioned works are pursuing higher PSNR for video SR. But in this paper, we consider video SR to improve action recognition accuracy. We focus on the loss functions instead of the network structures.

3 Analyses and Methods

Figure 1 depicts the pipeline of using SR for action recognition by two-stream networks. Given an LR video, we split it into frames on which we perform SR enhancement. We propose Spatial stream-oriented SR (SoSR) and Temporal stream-oriented SR (ToSR) for the two streams respectively. In other words, we enhance the LR video twice. We then calculate optical flow from the ToSR resulting video, and feed the optical flow together with frames from the SoSR resulting video into the following recognition module.

3.1 Action Recognition Module

In this paper, our SR methods are specifically designed for two-stream action recognition networks. Two-stream is a popular framework because of its simple but effective structure to incorporate both spatial and temporal information. Specifically, we use TSN [35] and ST-Resnet [8] in our experiments. There are minor differences between the two networks: TSN uses a weighted average of the classification scores predicted from the two streams, while ST-Resnet trains a fusion sub-network together with the two streams in an end-to-end fashion. Nonetheless, we focus on the SR part and we directly use the well-trained models provided by the authors, without any tuning.

3.2 Spatial-oriented SR

3.2.1 Analysis

According to the two-stream architecture, the spatial stream performs recognition from individual frames by recognizing objects. That says, the spatial stream is equivalent to image classification. Inspired by previous work [4], we expect that SR can enhance the LR frames and add more image details with which SR helps in recognition. Thus here, we experiment with a representative image SR method, namely VDSR [18]. However, we observed counterexamples in experiments. We calculate recognition accuracy for individual classes, and observed that VDSR sometimes performs even worse than the simple bicubic interpolation, more interestingly, the original HR frames can be even worse than super-resolved or interpolated frames. Such examples are summarized in Table 1. We distinguish several different cases according to the relative accuracy of different methods: HRVDSRBicubic for case (a), BicubicHRVDSR for case (b) and VDSRHRBicubic for case (c). Since LR frames lose details compared to HR frames, bicubic interpolation simply up-scales frames without adding details, and SR methods usually enhance interpolated frames with much more image details, we conjecture that image details can be either helpful or harmful for action recognition, especially in specific classes.

Case Class Recognition Accuracy (%)
HR Bicubic VDSR
a Archery 82.93 36.59 70.73
PlayingFlute 97.92 72.92 79.17
b JumpRope 39.47 42.11 7.89
SalsaSpin 79.07 83.72 53.49
c FrontCrawl 64.86 32.43 78.38
HandstandWalking 35.29 29.41 41.18
Table 1: We observed different cases in recognition accuracy for the classes in UCF101 using the TSN network. In case (a), HRVDSRBicubic. In case (b), BicubicHRVDSR. In case (c), VDSRHRBicubic. Some representative classes are shown. See Figure 2 for visual inspection.

In Figure 2, we visually analyze some frames to confirm our conjecture. In (a), which corresponds to HRVDSRBicubic, we indeed observed that many details about the bow and arrow lie in the HR frame, but are missing in Bicubic frame; the SR frame adds some details on the bow (shown in the blue box), which is helpful for recognition since the bow is directly related to the class Archery. In (b), which corresponds to BicubicHRVDSR, we observed that SR frame contains more details than Bicubic frame, but mostly on the background (shown in the blue box) rather than on the key object (shown in the red box); the added details are harmful for action recognition. In (c), which corresponds to VDSRHRBicubic, as the SR frame has more details on the human (which is directly related to the action recognition) but less details on the background (due to LR input), the recognition accuracy is boosted even over HR.

(a) 10-th frame of Archery_g01_c07
(b) 152-nd frame of JumpRope_g02_c02
(c) 39-th frame of HandstandWalking_g06_c01
Figure 2: Examples show how image details added by SR can influence the recognition accuracy. In (a), SR adds details on the bow, and since bow directly relates to archery, SR improves recognition than Bicubic. In (b), SR adds details on the background but not on the object, resulting in even lower accuracy than Bicubic. In (c), SR adds details on the walking woman but not on the background, resulting in even higher accuracy than HR. See Table 1 for the accuracy values.

3.2.2 Method

Based on the observation, we propose an SR method to selectively enhance the image regions that are related to action recognition. These regions usually correspond to high-motion regions, such as the bow in Figure 2 (a), the rotating rope in Figure 2 (b), and the walking woman in Figure 2 (c). In this paper, we select these regions according to the optical flow between frames since optical flow is a commonly chosen representation for motion information.

Most of SR networks use mean squared error (MSE) as loss function, which is to assume equal importance of every pixel. In contrast, we propose to use a weighted MSE (WMSE) based on optical flow to emphasize some pixels that are more important than others. In short, the loss function we used here is


where and are HR and SR frames respectively. and represent the magnitude of optical flow in the horizontal and vertical directions respectively. Here, the optical flow is calculated offline from the HR video using Flownet 2.0 [13], which we observed is slightly better than using TVL1 [39]. and are the height and width of frames and , are pixel indexes. In this way, the loss can guide the network in a pixel-wise manner: pixels with larger motion have larger optical flow magnitudes that are corresponding to larger loss weights and thus are paid more attention during SR enhancement.

In addition to WMSE, we further adopt perceptual loss, which has been widely used in recent SR methods for improving visual quality [16, 21]. Using perceptual loss is indeed to minimize the difference of high-level image features between SR image and HR image. Since high-level features are closer to classification than low-level ones, it is also quite suitable for our task. In this paper, we use the outputs of ‘conv3_3’ of the VGG-16 network to calculate perceptual loss, similar to [16].

As mentioned before, the spatial stream is equivalent to image classification. We anticipate that single frame SR can perform well for the spatial stream and also has lower complexity than video SR. Here, we adopt the network structure of VDSR [18] for our proposed SoSR. We retrain the network with our training data (details in Section 4) and the following loss function:


where is a weight.

We conduct an ablation study about the proposed loss function. As shown in Table 2, the proposed WMSE performs much better than the usual MSE in terms of the final recognition accuracy, and perceptual loss further improves for SoSR.

Dataset SoSR
MSE WMSE WMSE+Perceptual
UCF101 67.09% 74.10% 76.88%
HMDB51 46.60% 47.91% 49.48%
Table 2: Ablation study of SoSR with different loss functions.

3.3 Temporal-oriented SR

3.3.1 Analysis

We now switch to the temporal stream. As described in the two-stream architecture, the temporal stream takes optical flow as input to utilize temporal information. The effectiveness of this design is verified by many action recognition networks [35, 3]. Thus, we want to investigate how SR affects the quality of optical flow. We again experiment with the representative image SR method–VDSR [18]. Figure 3 shows the optical flow maps calculated from HR video, SR video, and bicubic interpolated video, respectively. Here the optical flow is calculated by the TVL1 method [39]. From Figure 3, we can find that the optical flow from bicubic video has a lot of artifacts, and VDSR even worsens the optical flow. Thus, the traditional SR methods incur less appealing results of recognition accuracy.

Figure 3: Optical flow maps calculated from HR, SR, and bicubic videos, respectively. Artifacts can be found in the circled regions. Zooming-in inspection shows that SR video has more artifacts than bicubic one. In this example, SR incurs lower recognition accuracy than bicubic.

Indeed, VDSR is an image SR network that enhances video frames individually and can cause temporal inconsistency. For high-quality optical flow, we need to ensure the temporal consistency between frames, which has also been studied in previous video SR works. For example, [2, 15] discussed the temporal consistency and its relation to visible flickering artifacts when displaying SR video. In Figure 4, we adopt the visualization method known as temporal profiles [2, 15] to display the flickering artifacts. As seen, VDSR indeed incurs more temporal discontinuity.

(a) A video clip (TaiChi_g01_c04)
(b) Temporal profiles of different results
Figure 4: (a) An example video clip. We sample one row at the same location (indicated by the red dot line) from each frame and concatenate the rows to produce (b) the temporal profiles. Obviously, bicubic video has the least image details, SR video has some details but displays temporal discontinuity that will cause flickering artifacts. In this example, SR incurs lower recognition accuracy than bicubic.

3.3.2 Method

Figure 5: Our proposed ToSR uses a siamese network for training. We jointly consider two consecutive frames and design four loss terms to ensure the quality of individual frames as well as the temporal consistency between them.

Based on the observation, we assume there is a relation between optical flow-based recognition accuracy and the temporal consistency in the SR video. Note that the existing video SR schemes usually perform SR frame by frame, which cannot guarantee the consistency between SR frames. Thus in this paper, we consider a siamese network for training video SR network.

The siamese network for training ToSR is shown in Figure 5. We use two copies of an SR network to enhance two consecutive frames respectively. First of all, we want the SR frames to have high quality, and we use two MSE losses for the two frames respectively, i.e.

. Moreover, we want to ensure the temporal continuity between SR frames. As our objective is to achieve high quality optical flow, it is straightforward to calculate the optical flow between SR frames and compare it with that between HR frames. However, this would require an optical flow estimation network to support end-to-end training. But recent optical flow networks

[13, 12, 28] are too deep to be efficiently trained (they cause trouble to the error back-propagation). In this paper, we take another approach to estimate the temporal continuity. We adopt the optical flow from HR video, which can be calculated beforehand, to perform warping between two SR frames. Let the optical flow be , we use the relation to warp the SR frame . The warped result is compared against both SR and HR frames of the previous timestamp. Accordingly, we define two losses: and .

In summary, the loss function for ToSR is


where , , are weights.

Any existing image or video SR network can implement ToSR. In this paper, we investigate two possibilities. The first (ToSR1) is based on the VDSR network [18], which performs SR for frames individually. The second (ToSR2) is based on the VSR-DUF with 16 layers [15], which utilizes multiple LR frames for SR.

4 Experiments

4.1 Datasets

We perform experiments using three datasets: one natural video dataset CDVL-134 and two action recognition datasets, UCF101 and HMDB51. For the video SR task, there is not a commonly used dataset. CDVL-134 is a dataset collected by ourselves from CDVL111, and contains 134 natural videos with various content, including landscapes, animals, activities and so on. Because the resolution of these videos varies from 480360 to 19201080, we resize them to around 320240 with bicubic interpolation while maintaining their aspect ratios. As for UCF101 and HMDB51, they are popular action recognition datasets. The former dataset contains 13,320 video clips belonging to 101 action categories, and the latter is composed of 51 action categories and 6,766 video clips. Both datasets provide three training/testing splits and we here only use the first split as a representative. For more details, please refer to [27] and [20] respectively.

4.2 Implement Details

Spatial-oriented SR. All LR video clips are generated by 4 down-sampling with bicubic interpolation. And we use bicubic interpolation to generate interpolated frames. FlowNet2.0 [13] is applied on the HR frames to calculate optical flow which is then processed into weight maps. The HR frames, interpolated LR frames and weight maps are cut into 128128 aligned patches to produce training samples. In particular, we randomly select 120 frames from each video of CDVL-134 dataset and choose the top 10 crops with the largest area of motion. Excluding some obviously low-quality patches, there are totally 136,950 patches for training, and 9,830 patches for validation. Besides, we use a VDSR model222 well trained on natural images as initialization and adopt pre-trained VGG-16 model333

. We use the deep learning framework Caffe to perform experiments. The learning rate of convolutional filters and biases are initialized as

and respectively and decrease to 1/10 every quarter of maximal iteration times. Batch size is set to 50. As for loss weight, we use to make the resulting WMSE and perceptual loss to be at the same scale.

Temporal-oriented SR. All the LR video clips and patches are prepared similarly as SoSR, except that TVL1 [39]

predictor is applied on HR frames to calculate optical flow for warping, and we have 143,250 patch pairs for training, and 10,386 pairs for validation. ToSR1 is implemented on Caffe with the same configuration as SoSR (except batch size is 60). ToSR2 is implemented on TensorFlow, and the initial learning rate is 0.01 and decreased to 1/10 every 10 epochs as recommended in

[15]. We use batch size 16 and fine-tune from the model provided by the authors of [15]. As for loss weights, we set for ToSR1 and for ToSR2.

4.3 Results

Baseline methods are VDSR [18], SPMC [32] and VSR-DUF [15] trained with MSE. Particularly, they are one single image SR method and two video SR methods. We train our SoSR and ToSRs on CDVL-134 and apply them on UCF101 and HMDB51 test split. Then TSN [35] and ST-Resnet [8] are used to obtain the recognition accuracy, shown in Table 3 and Table 4 respectively.

Method UCF101 HMDB51
Spatial Temporal Fusion Spatial Temporal Fusion
Bicubic 71.25 81.08 87.87 42.81 56.54 63.53
VDSR 67.09 79.81 86.84 46.6 55.1 63.59
SoSR 76.88 80.71 89.12 49.48 54.51 65.1
ToSR1 63.77 83.88 87.72 44.58 58.76 64.38
SPMC 70.42 80.19 87.15 48.95 56.41 64.31
VSR-DUF-16 68.56 84.89 89.36 48.37 59.48 66.08
VSR-DUF-52 70.54 85.09 89.85 48.5 60.52 66.86
ToSR2 64.32 85.22 87.95 46.99 61.24 65.62
SoSR+ToSR2 / / 91.18 / / 67.12
HR 86.02 87.63 93.49 54.58 62.16 69.28
Table 3: Recognition accuracy (%) of different methods using the TSN network. Number of VSR-DUF [15] indicates number of layers.
Method UCF101 HMDB51
Spatial Temporal Fusion Spatial Temporal Fusion
Bicubic 72.01 78.28 84.62 43.59 53.76 59.48
VDSR 72.27 79.43 84.48 49.18 54.44 60.2
SoSR 80.12 78.87 87.32 53.4 53.24 63.33
ToSR1 67.45 80.35 83 47.45 56.44 62.09
SPMC 74.45 77.44 84.09 53.14 53.53 63.66
VSR-DUF-16 72.11 80.06 83.9 50.62 55.07 61.11
VSR-DUF-52 74.49 80.16 84.88 52.84 57.61 65.23
ToSR2 69.69 80.37 83.43 51.47 58.73 64.31
SoSR+ToSR2 / / 87.58 / / 65.82
HR 88.01 85.71 92.94 56.01 59.41 68.1
Table 4: Recognition accuracy (%) of different methods using the ST-Resnet.

On spatial stream, we can find BicubicVDSRvideo SR in most cases. This result is quite intuitive because more advanced SR method would generate SR frames with more details and be more helpful to recognition on average. Further analysis is going as follow. Consider that Bicubic algorithm is independent of inputs and does not require any training, which can be regarded as a stable method among baseline methods. We draw a scatter plot to compare accuracy of TSN between different SR methods and Bicubic. From Figure 6, we can find that training SR networks with MSE, the points disperse throughout the plane which means these SR methods are unstable for the classifier. In other words, optimizing SR networks with MSE does not consistently improve recognition performance. Referring to Table 5, we can find SR methods perform well consistently in some classes while worse consistently in the other. However, situations are improved in figure of SoSR. According to the distribution and dispersion of points, our SoSR can generate SR frames easier to classified and has better performance stability in different classes. From Table 5, we can find that in easy classes, our SoSR has better recognition performance than VDSR, our based architecture, and in hard classes, improves recognition performance significantly, even beats Bicubic.

Figure 6: Scatter plots of recognition accuracy values of the 101 classes (UCF101 with TSN network) obtained by different SR methods and by bicubic. Referring to the equal-accuracy line (shown as dash line), our SoSR is clearly better than bicubic but the other SR methods do not outperform bicubic consistently. We further analyze several classes indicated by colorful symbols, see Table 5 for details.
Symbol Class Bicubic VDSR SPMC VSR-DUF-52 SoSR
PoleVault 72.5 27.5 45 45 87.5
Swing 78.57 21.43 35.71 30.95 83.33
TennisSwing 77.55 20.41 18.73 18.73 48.98
WalkingWithDog 77.78 22.22 19.44 16.67 47.22
WallPushups 8.57 45.71 71.43 71.43 65.71
TaiChi 46.43 85.71 92.86 92.86 89.29
FrontCrawl 32.43 78.38 91.89 94.59 89.19
Table 5: Class and recognition accuracy (%) of several classes highlighted in Figure 6 using TSN. Note that our SoSR has the same structure as VDSR, but improves much in accuracy especially for hard classes.

Switching to temporal stream where there is an obvious SPMCVSR-DUF among multi-frame SR methods, even VSR-DUF-16 beats SPMC with a big gap. This difference may be attributed to the design of network structure. SPMC performs explicit warping with optical flow estimated from LR frames which may introduce errors and undermine temporal consistency. While VSR-DUF uses 3D convolution directly operating on 7 consecutive LR frames to predict dynamic filters which are then used to up-sample central LR frame. This novel structure without explicit motion compensation may be the key for VSR-DUF to keep good performance. However, adding our optical flow restriction, the performance is even better. Comparing the accuracy of VDSR and our ToSR1 verifies that our training strategy could refine single frame SR networks by maintaining temporal consistency. But its performance is limited by lack of information about front and rear frames. Breaking this limitation, our ToSR2, SR network based on 16-layer VSR-DUF, outperforms VSR-DUF-52 and approaches HR frames’ performance.

Finally, we take a view of both streams in Table 3, 4. Comparing rows with similar architectures (VDSR, SoSR, ToSR1 and VSR-DUF, ToSR2 respectively), we can find a trade-off between accuracy of two streams. Besides the fusion results also verify the advantages of performing SR respectively on two streams.

4.4 Visual Inspection

To obtain video-level accuracy, we sample 25 frames from each video clip and perform recognition as recommended in [35]. Figure 7 shows video-level accuracy of TSN and visual quality of example frames. Generally, from accuracy and visual quality of Bicubic, SoSR444Zooming-in inspection of SoSR results shows visible blocking artifacts, which we have confirmed is caused by the joint effect of the perceptual loss and the (even invisible) blocking artifacts in the input video. Refer to the supplementary material for more details. and HR frames, the recognition accuracy increases as the visual quality increases. But situations are slightly different among SR methods. This may result from imperfection of recognition CNN.

Figure 7: Visual quality comparison and the numbers indicate video-level recognition accuracy using TSN. In general recognition accuracy correlates positively to visual quality, but cases are different when comparing the SR methods. This can be attributed to the imperfectness of action recognition networks.

Switching to temporal stream, Figure 8 shows TVL1 optical flow calculated from different frames, from which we can find optical flow of both VSR-DUF and our ToSR2 have less artifacts. While our ToSR2 is more similar to HR than VSR-DUF, especially in highlight regions. Using view of video SR, temporal profiles are shown in Figure 9. We can observe that Bicubic and VDSR frames do not reconstruct enough image details and SPMC as well as VSR-DUF are shape but do not restore the handrail in each frame. In contrast, our ToSR2 has the best temporal consistency. Please refer to supplementary materials for more visualization.

Figure 8: Optical flow maps calculated from different results. Our ToSR and VSR-DUF clearly outperform the other methods. Zooming-in inspection in the circled regions shows that our ToSR outperforms VSR-DUF-52.
Figure 9: Temporal profiles of different results. Bicubic and VDSR failed to reconstruct vivid image details. SPMC and VSR-DUF incur obvious temporal discontinuity.

5 Conclusion

In this paper, we consider the video SR problem not for visual quality, but for facilitating action recognition accuracy. Tailored for two-stream action recognition networks, we propose SoSR with optical flow guided weighted MSE loss, and ToSR with a siamese network to emphasize temporal consistency, for the spatial and temporal streams respectively. Experimental results demonstrate the advantages of our proposed SoSR and ToSR methods. In the future, we plan to test our methods on real-world videos, and also consider video SR and action recognition in a holistic network. The trade-off between PSNR or perceptual quality and recognition accuracy is also worth studying.


  • [1] Y. Blau and T. Michaeli. The perception-distortion tradeoff. In CVPR, pages 6228–6237, 2018.
  • [2] J. Caballero, C. Ledig, A. P. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In CVPR, volume 1, pages 4778–4787, 2017.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, pages 4724–4733, 2017.
  • [4] D. Dai, Y. Wang, Y. Chen, and L. Van Gool. Is image super-resolution helpful for other vision tasks? In WACV, pages 1–9, 2016.
  • [5] A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, pages 2329–2338, 2017.
  • [6] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
  • [7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):295–307, 2016.
  • [8] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468–3476, 2016.
  • [9] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [11] S. Herath, M. Harandi, and F. Porikli. Going deeper into action recognition: A survey. Image and Vision Computing, 60:4–21, 2017.
  • [12] T.-W. Hui, X. Tang, and C. C. Loy. LiteFlowNet: A lightweight convolutional neural network for optical flow estimation. In CVPR, pages 8981–8989, 2018.
  • [13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, volume 2, pages 2462–2470, 2017.
  • [14] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013.
  • [15] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In CVPR, pages 3224–3232, 2018.
  • [16] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711, 2016.
  • [17] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
  • [18] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, pages 1646–1654, 2016.
  • [19] J. Kim, J. K. Lee, and K. M. Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, pages 1637–1645, 2016.
  • [20] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, pages 2556–2563, 2011.
  • [21] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, number 3, pages 4681–4690, 2017.
  • [22] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In CVPRW, number 2, pages 136–144, 2017.
  • [23] K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Baro, H. J. Escalante, and T. B. Moeslund. Deep learning based super-resolution for improved action recognition. In Image Processing Theory, Tools and Applications, pages 67–72, 2015.
  • [24] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3D residual networks. In ICCV, pages 5534–5542, 2017.
  • [25] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, pages 1874–1883, 2016.
  • [26] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
  • [27] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [28] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8934–8943, 2018.
  • [29] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In ICCV, pages 4597–4605, 2015.
  • [30] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In CVPR, number 2, pages 3147–3155, 2017.
  • [31] Y. Tai, J. Yang, X. Liu, and C. Xu. MemNet: A persistent memory network for image restoration. In CVPR, pages 4539–4547, 2017.
  • [32] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia. Detail-revealing deep video super-resolution. In ICCV, pages 22–29, 2017.
  • [33] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In ICCV, pages 4489–4497, 2015.
  • [34] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1510–1517, 2018.
  • [35] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36, 2016.
  • [36] X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu. Two-stream 3D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634–644, 2018.
  • [37] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM MM, pages 461–470, 2015.
  • [38] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, pages 4694–4702, 2015.
  • [39] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L optical flow. In

    Joint Pattern Recognition Symposium

    , pages 214–223, 2007.
  • [40] H. Zhang, D. Liu, and Z. Xiong. Convolutional neural network-based video super-resolution for action recognition. In FG, pages 746–750, 2018.
  • [41] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep residual channel attention networks. In ECCV, pages 1–16, 2018.
  • [42] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In CVPR, pages 449–458, 2018.