Self-Conditioned Probabilistic Learning of Video Rescaling

Bicubic downscaling is a prevalent technique used to reduce the video storage burden or to accelerate the downstream processing speed. However, the inverse upscaling step is non-trivial, and the downscaled video may also deteriorate the performance of downstream tasks. In this paper, we propose a self-conditioned probabilistic framework for video rescaling to learn the paired downscaling and upscaling procedures simultaneously. During the training, we decrease the entropy of the information lost in the downscaling by maximizing its probability conditioned on the strong spatial-temporal prior information within the downscaled video. After optimization, the downscaled video by our framework preserves more meaningful information, which is beneficial for both the upscaling step and the downstream tasks, e.g., video action recognition task. We further extend the framework to a lossy video compression system, in which a gradient estimator for non-differential industrial lossy codecs is proposed for the end-to-end training of the whole system. Extensive experimental results demonstrate the superiority of our approach on video rescaling, video compression, and efficient action recognition tasks.


page 3

page 6

page 8


Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Recent self-supervised video representation learning methods focus on ma...

Video Representation Learning with Visual Tempo Consistency

Visual tempo, which describes how fast an action goes, has shown its pot...

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Capitalizing on large pre-trained models for various downstream tasks of...

Pragmatic Image Compression for Human-in-the-Loop Decision-Making

Standard lossy image compression algorithms aim to preserve an image's a...

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

The goal of this paper is to self-train a 3D convolutional neural networ...

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

The standard way of training video models entails sampling at each itera...

1 Introduction

High-resolution videos are widely used over various computer vision tasks 


. However, considering the increased storage burden or the high computational cost, it is usually required to first downscale the high-resolution videos. Then we can either compress the output low-resolution videos for saving storage cost or feed them to the downstream tasks to reduce the computational cost. Despite that this paradigm is prevalent, it has the following two disadvantages. First, it is non-trivial to restore the original high-resolution videos from the (compressed) low-resolution videos, even we use the latest super-resolution methods 

[30, 65, 48, 59]. Second, it is also a challenge for the downstream tasks to achieve high performance based on these low-resolution videos. Therefore, it raises a question that whether the downscaling operation can facilitate the reconstruction of the high-resolution videos and also preserve the most meaningful information for the downstream tasks.

Recently, this question has been partially studied as a single image rescaling problem [24, 27, 47, 62], which learns the image downscaling and upscaling operators jointly. However, how to adapt these methods from image to video domain and leverage the rich temporal information within videos are still open problems. More importantly, modeling the lost information during downscaling is non-trivial. Current methods either ignore the lost information [24, 27, 47] or assume it as an independent distribution in the latent space [62], while neglecting the internal relationship between the downscaled image and the lost information. Besides, all literature mentioned above have not explored how to apply the rescaling technique to the lossy image/video compression.

In this paper, we focus on building a video rescaling framework and propose a self-conditioned probabilistic learning approach to learn a pair of video downscaling and upscaling operators by exploiting the information dependency within the video itself. Specifically, we first design a learnable frequency analyzer to decompose the original high-resolution video into its downscaled version and the corresponding high-frequency component. Then, a Gaussian mixture distribution is leveraged to model the high-frequency component by conditioning on the downscaled video. For accurate estimation of the distribution parameters, we further introduce the local and global temporal aggregation modules to fuse the spatial information from adjacent downscaled video frames. Finally, the original video can be restored by a frequency synthesizer from the downscaled video and the high-frequency component sampled from the distribution. We integrate the components above as a novel self-conditioned video rescaling framework termed SelfC and optimize it by minimizing the negative log-likelihood for the distribution.

Furthermore, we apply our proposed SelfC in two practical applications, i.e. lossy video compression and video action recognition. In particular, to integrate our framework with the existing non-differential video codecs (e.g., H.264 [61] and H.265 [46]), we propose an efficient and effective one-pass optimization strategy based on the control variates method and approximate the gradients of traditional codecs in the back-propagation procedure, which formulates an end-to-end optimization system.

Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on the video rescaling task. More importantly, we further demonstrate the effectiveness of the framework in practical applications. For the lossy video compression task, compared with directly compressing the high-resolution videos, the video compression system based on our SelfC framework cuts the storage cost significantly (up to 30% reduction). For the video action recognition task, our framework reduces more than 60% computational complexity with negligible performance degradation.

In summary, our main contributions are:

  • We propose a probabilistic learning framework dubbed SelfC for the video rescaling task, which models the lost information during downscaling as a dynamic distribution conditioned on the downscaled video.

  • Our approach exploits rich temporal information in downscaled videos for an accurate estimation of the distribution parameters by introducing the specified local and global temporal aggregation modules.

  • We propose a gradient estimation method for non-differential lossy codecs based on the control variates method and Monte Carlo sampling technique, extending the framework to a video compression system.

2 Related Work

Video Upscaling after Downscaling. Traditional video downscaling approaches subsample the input high-resolution (HR) videos by a handcrafted kernel, such as Bilinear and Bicubic. For restoration, video super-resolution (SR) methods are utilized. Since the SR task is inherently ill-posed, previous SR works [30, 65, 48, 59, 16, 23]

mainly leverage a heavy neural network to hallucinate the lost details, only achieving unsatisfactory results. Taking the video downscaling method into consideration may help mitigate the ill-posedness of the video upscaling procedure.

There are already a few works on single image rescaling task in a similar spirit, which consider the downscaling and the upscaling of the image simultaneously. For example, Kim  [24] proposed a task-aware downscaling model based on an auto-encoder framework. Later, Li  [27]

proposed to use a convolution neural network (CNN) to estimate the downscaled low-resolution images for a given super-resolution method. For stereo matching task, Yang  

[67] proposed a superpixel-based downsampling/upsampling scheme to effectively preserve object boundaries and fine details More recently, Xiao  [62] proposed to leverage an invertiable neural network (INN) to model the two reciprocal steps, which relies on a very deep INN to map the complex distribution of the lost information to an independent and fixednormal distribution.

However, these methods neither leverage the temporal information between adjacent frames, which is important for video related tasks, nor consider the fact that the components of different frequencies in natural images or videos are conditionally dependent [56, 45, 38, 55].

Video Compression. Several traditional video compression algorithms have been proposed and widely deployed, such as H.264 [61] and H.265 [46]. Most of them follow the predictive coding architecture and rely on the sophisticated hand-crafted transformations to analyze the redundancy within the videos. Recently, fully end-to-end video codecs [14][5][68][1][29][31][19][34] such as DVC [32] have been proposed by considering the rate-distortion trade-off of the whole compression system. They demonstrate promising performance and may be further improved by feeding more ubiquitous videos in the wild. However, they haven’t been widely used by industrial and are lack of the hardware implementation. In contrast, our framework can be readily integrated with the best traditional video codecs and further saves the storage space of the compressed video significantly.

Video Action Recognition. Simonyan  [44] first proposed the two-stream framework. Feichtenhofer  [8] then improved it. Later, wang  [58] proposed a new sparse frame sampling strategy. Recently, 3D networks [51][4][15][52][39][7][49] also show promising performance. Our work can accelerate the off-the-shelf action CNNs by 3-4 times while reserving the comparable performance. We mainly conduct experiments on light-weight 2D action CNNs (e.g., TSM [28]) based on 2D-ResNet50 [17] for efficiency.

Figure 1: Overview of the proposed framework. We exploit the conditional relationship between different frequency components within the video for better video rescaling, which are disentangled by the frequency analyzer. During downscaling, the low-frequency (LF) component is quantized to produce the downscaled low-resolution (LR) video . During upscaling, the probability density of the high-frequency (HF) component is predicted by a spatial-temporal prior network (STP-Net). Then, and the sampled HF component

from the probability distribution is reconstructed to the high-resolution video by the frequency synthesizer. The storage medium can be lossless or lossy for different applications.

denotes channel concatenation operation. TConv and Conv represent the 1D convolution with kernel size 1 and the 3D convolution with kernel size

, respectively. The dimensions of some tensors are also indicated, where W

, H and T.

3 Proposed Method

An overview of our proposed SelfC framework is shown in Fig. 1 (a). During the downscaling procedure, given a high-resolution (HR) video, a frequency analyzer (FA) (Section 3.1) first converts it into video features , where the first channels are low-frequency (LF) component , the last channels are high-frequency (HF) component , and is the downscaling ratio. Then, is quantized to a LR video for storage. is discarded in this procedure.

During the upscaling procedure, given the LR video , the spatial-temporal prior network (STP-Net) (Section 3.3

) predicts the probability density function of the HF component



We model

as a continuous mixture of the parametric Gaussian distributions (Section 

3.2). Then, a case of the HF component related to LR video is drawn from the distribution. Finally, we reconstruct the HR video from the concatenation of HF component and LR video by the frequency synthesizer (FS).

Figure 2: Dense2D-T block. SConv denotes the spatial convolution, i.e., 3D convolution of kernel size . TConv denotes the temporal convolution, i.e., 3D convolution of kernel size

. LReLU denotes the Leaky ReLU non-linearity 


3.1 Frequency Analyzer and Synthesizer

As shown in Fig. 1 (b), we first decompose the HR input video as the times downscaled low-frequency component and the residual high-frequency component , where denotes the spatial scale of the original video and denotes the video length. and represent bicubic downscaling and upscaling operations with scaling ratio . is the inverse operation of the pixel shuffling operation proposed in  [43], where the scaling ratio is also . Then, we use a learnable transformation to transform and to the output features , where denotes the channel concatenation operation. Here the produced video feature consists of LF component and HF component . The network architecture for is very flexible in our framework and we use multiple stacking Dense2D-T blocks to implement by default. The architecture of Dense2D-T block is shown in Fig. 2, where we extend the vanilla Dense2D block [20] with the temporal modeling ability.

The architecture of frequency synthesizer is symmetric with the analyzer, as shown in Fig. 1 (c). Specifically, we use channel splitting, bicubic upscaling and pixel shuffling operations to synthesize the final high resolution videos based on the reconstructed video feature .

3.2 A Self-conditioned Probabilistic Model

Directly optimizing in Eq. (1) through gradient descent is unstable due to the unsmooth gradient [24] of the quantization module. Thus, we optimize instead during training procedure. Specifically, we represent the high-frequency component as a continuous multi-modal probability distribution conditioned on the low-frequency component , which is formulated as:



denotes the spatial-temporal location. We use a continuous Gaussian Mixture Model (GMM) 

[40] to approximate with component number . The distributions are defined by the learnable mixture weights , means

and log variances

. With these parameters, the distributions can be accurately determined as:




and denotes the spatial-temporal location.

3.3 Spatial-temporal Prior Network (STP-Net)

As shown in Fig. 1 (d), to estimate the parameters of the distribution above, we propose the STP-Net to model both the local and global temporal information. We first utilize the Dense2D-T block to extract the short term spatial-temporal features for each input frame. In this stage, only information from local frames, i.e., the previous or the next frames, are aggregated into the current frame, while the temporally long-range dependencies in videos are neglected. Therefore, we further introduce the attention mechanism for modeling the global temporal information. More specifically, the spatial dimension of the short-term spatial-temporal features is first reduced by a spatial aggregator, which is implemented as an average pooling operation followed by a full-connected (FC) layer. The output scale of the pooling operation is 32

32. Then we use dot-producting operation to generate the attention map, which represents the similarity scores between every two frames. Finally, we refine the local spatial-temporal features based on the similarity scores. We repeat the following procedure for 2 times to extract better video features. After that, a 3-layer multi layer perceptron (MLP) is used to estimate the parameters of the GMM distribution, where the linear layers are implemented as 3D convolutions of kernel size


3.4 Quantization and Storage Medium

We use rounding operation as the quantization module, and store the output LR videos by lossless format, i.e., H.265 lossless mode. The gradient of the module is calculated by Straight-Through Estimator [2]. We also discuss how to adapt the framework to more practical lossy video formats such as H.264 and H.265 in Section 3.6.

3.5 Training Strategy

Building a learned video rescaling framework is non-trivial, especially the generated low-resolution videos are expected to benefit both the upscaling procedure and the downstream tasks. We consider the following objectives.

Self-conditioned Probability Learning. First, to make sure the STP-Net can obtain an accurate estimation for the HF component , we directly minimize the negative log-likelihood of in Eq. (2):


where is the number of the training samples.

Mimicking Bicubic downscaling. Then the downscaled video is preferred to be similar to the original video, making its deployment for the downstream tasks easier. Therefore, we regularize the the downscaled video before quantization, i.e., , to mimic the bicubic downsampled :


Penalizing . Without any extra constraint, Eq. (5) can be easily minimized by tuning to one constant tensor for any input video. Thus, to avoid the trivial solution, the CNN parts of frequency analyzer and synthesizer are penalized by the photo-parametric loss (i.e., loss) between the video directly reconstructed from and the original input :


Minimizing reconstruction difference. Finally, the expected difference between the reconstructed video sampled from the model and the original video should be minimized:


where denotes a photo-metric loss (i.e., loss), denotes the channel-wise concatenation operation. In each training iteration, is sampled from the distribution constructed from the parameters output by STP-Net, conditioning on the LR video . To enable an end-to-end optimization, we apply the “reparametrization trick” [26, 41, 13] to make the sampling procedure differentiable. More details are provided in the supplementary material.

The total loss is then given by:


where , , and are the balancing parameters, and

is the scaling ratio. The loss function of our framework may seem a little bit complicated. However, we want to mention that the performance of our framework is not sensitive to these hyper-parameters and directly setting all the parameters to 1 already achieves reasonable performance.

3.6 Application I: Video Compression

In this section, we extend the proposed SelfC framework to a lossy video compression system and aim to demonstrate the effectiveness of our approach in reducing the storage size. The whole system is shown in Fig. 3. Specifically, we first use the SelfC framework to generate the downscaled video , which will be compressed by using the existing codecs, e.g., H.265. Then at the decoder side, the compressed videos will be decompressed and upscaled to the full resolution video.

Considering the traditional video codecs are non-differential, we further propose a novel optimization strategy. Specifically, we introduce a differential surrogate video perturbator , which is implemented as a deep neural network (DNN) consisting of 6 Dense2D-T blocks. During the back-propagation stage, the gradient of the codec can be approximated by that of , which is tractable. During the test stage, the surrogate DNN is removed and we directly use the H.265 codec for compression and decompression.

According to the control variates theory [10, 12], can be an low-variance gradient estimator for the video codec (i.e., ) when (1) the differences between the outputs of the two functions are minimized and (2) the correlation coefficients of the two output distributions are maximized.

Therefore, we introduce these two constraints to the optimization procedure of the proposed SelfC based video compression system. And the loss function for the surrogate video perturbator is formulated as:


where is set to a small value, i.e., 0.001, and is estimated within each batch by Monte Carlo sampling:




and denotes the batch size. Finally, the total loss function for the SelfC based video compression system is given by:

Figure 3: We introduce a surrogate DNN to calculate the gradient of the non-differential codec. We take the H.265 codec as the example.

3.7 Application II: Efficient Action Recognition

We further apply the proposed SelfC framework to the video action recognition task. Specifically, we adopt the LR videos (i.e., ) downscaled by our framework as the input of action recognition CNNs for efficient action recognition. Considering the downscaler of our approach can preserve meaningful information for the downstream tasks and the complexity of itself can be rather low, inserting the downscaler before the off-the-shelf action CNNs can reduce the huge computational complexity of them with negligible performance drop. Moreover, the light-weightiness of the rescaling framework makes the joint optimization tractable. In fact, compared with bicubic downscaling operation, our downscaler in SelfC framework can still generate more informative low-resolution videos for the action recognition task even without the joint training procedure. Please see Section 4.5 for more experimental results.

4 Experiments

4.1 Dataset

We use Vimeo90K dataset [65] as our training data, which is also adopted by the recent video super resolution methods [21, 22, 59] and video compression methods [32, 18, 31]. For video rescaling task, the evaluation datasets are the test set of Vimeo90K (denoted by Vimeo90K-T), the widely-used Vid4 benchmark [30] and SPMCs-30 dataset [48]. For video compression task, the evaluation datasets include UVG [53], MCL-JCV [57], VTL [54] and HEVC Class B [46]. For video recognition task, we train and evaluate it on two large scale datasets requiring temporal relation reasoning, i.e., Something V1&V2 [11][36].

Figure 4: Qualitative comparison on the reconstruction of 4 downscaled calendar clip. Best to view by zooming in.
Downscaling Upscaling #Frame FLOPs #Param. Calendar (Y) City (Y) Foliage (Y) Walk (Y) Vid4-avg(Y) Vid4-avg(RGB)
Bicubic Bicubic 1 N/A N/A 18.83/0.4936 23.84/0.5234 21.52/0.4438 23.01/0.7096 21.80/0.5426 20.37/0.5106
Bicubic SPMC  [48] 3 - -   -/-   -/-   -/-   -/- 25.52/0.76 -/-
Bicubic Liu  [30] 5 - - 21.61/- 26.29/- 24.99/- 28.06/- 25.23/- -/-
Bicubic TOFlow [65] 7 0.81T 1.41M 22.29/0.7273 26.79/0.7446 25.31/0.7118 29.02/0.8799 25.85/0.7659 24.39/0.7438
Bicubic FRVSR [42] 2 0.14T 5.05M 22.67/0.7844 27.70/0.8063 25.83/0.7541 29.72/0.8971 26.48/0.8104 25.01/0.7917
Bicubic DUF-52L [23] 7 0.62T 5.82M 24.17/0.8161 28.05/0.8235 26.42/0.7758 30.91/0.9165 27.38/0.8329 25.91/0.8166
Bicubic RBPN [16] 7 9.30T 12.2M 24.02/0.8088 27.83/0.8045 26.21/0.7579 30.62/0.9111 27.17/0.8205 25.65/0.7997
Bicubic EDVR-L  [59] 7 0.93T 20.6M 24.05/0.8147 28.00/0.8122 26.34/0.7635 31.02/0.9152 27.35/0.8264 25.83/0.8077
Bicubic PFNL  [69] 7 0.70T 3.00M 23.56/0.8232 28.11/0.8366 26.42/0.7761 30.55/0.9103 27.16/0.8365 25.67/0.8189
Bicubic RLSP [9] 3 0.09T 4.21M 24.36/0.8235 28.22/0.8362 26.66/0.7821 30.71/0.9134 27.48/0.8388 25.69/0.8153
Bicubic TGA [22] 7 0.23T 5.87M 24.50/0.8285 28.50/0.8442 26.59/0.7795 30.96/0.9171 27.63/0.8423 26.14/0.8258
Bicubic RSDN 9-128 [21] 2 0.13T 6.19M 24.60/0.8355 29.20/0.8527 26.84/0.7931 31.04/0.9210 27.92/0.8505 26.43/0.8349
IRN IRN [62] 1 0.20T 4.36M 25.05/0.8424 29.84/0.8825 28.79/0.8604 34.24/0.9584 29.48/0.8859 27.58/0.8618
SelfC-small SelfC-small 7 0.026T 0.68M 26.02/0.8675 30.26/0.8953 29.73/0.8829 34.87/0.9747 30.28/0.9021 27.98/0.8763
SelfC-large SelfC-large 7 0.093T 2.65M 27.12/0.9018 31.16/0.9171 30.59/0.9123 35.56/0.9730 31.11/0.9261 29.06/0.9065
Table 1: Quantitative comparison (PSNR (dB) and SSIM) on Vid4 for video rescaling. Y indicates the luminance channel. FLOPs (MAC) are calculated on an HR frame of size 720480.
Downscaling Bicubic Bicubic Bicubic Bicubic Bicubic Bicubic Bicubic IRN SelfC-small SelfC-large
Upscaling Bicubic TOFlow [65] FRVSR [42] DUF-52L [23] RBPN [16] PFNL[69] RSDN 9-128 [21] IRN [62] SelfC-small SelfC-large
SPMCs-30(Y) 23.29/0.6385 27.86/0.8237 28.16/0.8421 29.63/0.8719 29.73/0.8663 29.74/0.8792 -/- 34.45/0.9358 34.51/0.9372 37.86/0.9710
SPMCs-30(RGB) 21.83/0.6133 26.38/0.8072 26.68/0.8271 28.10/0.8582 28.23/0.8561 27.24/0.8495 -/- 32.26/0.9185 32.33/0.9201 35.19/0.9567
Vimeo90K-T(Y) 31.30/0.8687 34.62/0.9212 35.64/0.9319 36.87/0.9447 37.20/0.9458 -/- 37.23/0.9471 37.47/0.9686 37.18/0.9701 38.08/0.9762
Vimeo90K-T(RGB) 29.77/0.8490 32.78/0.9040 33.96/0.9192 34.96/0.9313 35.39/0.9340 -/- 35.32/0.9344 35.14/0.9535 35.02/0.9542 35.55/0.9627
Table 2: Quantitative comparison (PSNR(dB) and SSIM) on SPMCs-30 and Vimeo90K-T for video rescaling.

4.2 Implementation Details

(1) Video rescaling: , , and are set as 0.1, 1, 1 and 1, respectively. Each training clip consists of 7 RGB patches of size 256256. The batch size is set as 32. We augment the training data with random horizontal flips and 90 rotations. We train our model with Adam optimizer [25] by setting as 0.9, as 0.99, and learning rate as

. The total training iteration number is about 240,000. The learning rate is divided by 10 every 100,000 iterations. We implement the models with the PyTorch framework 

[37] and train them on a server with 8 NVIDIA 2080Ti GPUs. We draw 5 times from the generated distribution for each evaluation and report the averaged performance. We leverage the invertiable neural network (INN) architecture to implement the CNN parts of the paired frequency analyzer and synthesizer for fair comparison with IRN on parameter number because INN will cut the number of parameters by 50%. We propose the two following models: the SelfC-small and SelfC-large, which consist of 2 and 8 invertiable Dense2D-T blocks respectively. The detailed architecture of this block is in the supplementary material, roughly following [62]. Training the SelfC-large model takes about 90 hours.

(2) Video compression: The rescaling ratio of the SelfC is set to 2. We use H.265 as our default codec in the experiments. is set as 100 to make sure the statistical distribution of the downscaled videos is more closed to the natural images, which stabilizes the performance of the whole system. The other details follow that of video rescaling task. The models are initialized from SelfC-large model but the number of invertiable Dense2D-T blocks is reduced to 4. The surrogate CNN is randomly initialized, and are jointly optimized with video rescaler.

(3) Action recognition: We insert the downscaler of our framework before the action recognition CNN (i.e., TSM [28]). The data augmentation pipeline also follows TSM. The downscaling ratio is 2. At inference time, we used just 1 clip per video and each clip contains 8 frames. We adopt 2 plain Dense2D-T blocks with intermediate channel number of 12 as the CNN part of frequency analyzer. Note that the downscaler is first pretrained on Vimeo90K dataset by the video rescaling task.

4.3 Results of Video Rescaling

As shown in Tab. 1 and Tab. 2, our method outperforms the recent state-of-the-art video super resolution methods on Vid4, SPMCs-30 and Vimeo90K-T by a large margin in terms of both PSNR and SSIM. For example, the average PSNR(Y) results on the Vid4 dataset for our SelfC-large is 31.11dB, while the corresponding result for the state-of-the-art video super resolution approach RSDN is only 27.92dB. Furthermore, we also provide the results of image rescaling method, i.e., IRN, in Tab. 1 and Tab. 2. It is obvious that our method outperforms IRN while also reduces the computational complexity by 2 times (SelfC-large) or 8 times (SelfC-small). This result clearly demonstrates that it is necessary to exploit the temporal relationship for video rescaling task, while the existing image rescaling methods like IRN ignored this temporal cue. We also show the qualitative comparison with other methods on Vid4 in Fig. 4. Please refer to the supplementary material for more qualitative results. Our method demonstrates much better details and sharper images than both video super-resolution methods and image rescaling method, proving the superiority of the video rescaling paradigm.

Figure 5: Comparison between the proposed method with H.265, H.264 and the learning based video codec DVC [32].

4.4 Results of Video Compression

For fair comparison, both the standard H.265 codec and the codec embedded in our framework follow the same setting in [32]

and use the FFmpeg with very fast mode. The evaluation metrics are PSNR and MS-SSIM 


Fig. 5 shows the experimental results on the UVG and MCL-JCV datasets. It is obvious that our method outperforms both the traditional methods and learning based method on video compression task by a large margin. On the UVG dataset, the proposed method achieves about 0.8dB gain at the same Bpp level in comparison with H.265. Although our method is only optimized by loss, it demonstrates strong performances in terms of both PSNR and MS-SSIM metrics.

Dataset UVG VTL MCL-JCV Average
-39.86 -27.45 -16.75 -28.02
-29.01 -14.42 -26.59 -23.34
Table 3: BDBR results using H.265 as the anchor. The lower value, the more bit cost reduced.

We also evaluate the Bjøntegaard Delta Bit-Rate (BDBR) [3] by using H.265 as the anchor method, and calculate the average bit-rate difference at the same PSNR or MS-SSIM, which indicates the storage burden reduction quantitatively. As shown in Tab. 3 , our method saves the bit cost by over 25% averagely under the same PSNR and over 20% under the same MS-SSIM. Notably, we reduce the bit cost by about 40% on UVG dataset. This proves that video rescaling technique is a novel and effective way to improve the video compression performance, without considering much about the complicated details of the codecs, especially for the industrial lossy codecs.

We perform more analysis to verify the effectiveness of the “Video rescaling+Codec” paradigm and the proposed gradient estimation method. As shown in Fig. 6, it is observed that using Bicubic as the downscaler and upscaler in the video compression system (i.e., H.265+Bicubic) leads to much inferior result than the baseline. We also try to improve the result by using a state-of-the-art video super resolution method, i.e., TGA [22]. The performance is indeed improved though still lower than the baseline method H.265. Considering the network parameters of TGA are 5.87M while ours are only 2.65M, this result further demonstrates the effectiveness of our SelfC framework. Finally, we provide experimental results (i.e., Ours W/O Gradient) when directly using the biased Straight-Through Estimator [2] for H.265 codec. The results show that the proposed gradient estimation method in Section 3.6 can bring nearly 0.3dB improvements.

Figure 6: Comparison of our “Video rescaling+Codec” scheme with other paradigms.
Input HR video LR video HF component Sample 1 of Sample 2 of
Figure 7: Visualization of the high-frequency component sampled from the learned distribution . We also compare the difference of the downscaled video by ours and Bicubic. Since the magnitude of the difference is small, we amplify it by 10 times for better visualization.

4.5 Results of Efficient Video Action Recognition

We show the video action recognition results in Tab. 4. In the first group, when directly testing the action model pretrained on full resolution videos, we observe the performances of low-resolution videos downscaled by Bicubic and ours are both dropped drastically, because the classification networks are rather sensitive to the absolute scale of the input. However, our downscaler still performs better (30.4% vs. 32.7% on Something V1 dataset).

v1 v2
Top1 (%) Top5 (%) Top1 (%) Top5 (%)
HR TSM 33G 24.31M 45.6 74.2 58.8 85.4
Bicubic TSM 8.6G 24.31M 30.4 57.5 40.1 68.7
Ours TSM 10.8G 24.35M 32.7 59.5 41.8 71.5
Bicubic (FT) TSM 8.6G 24.31M 42.1 72.1 56.0 83.9
Ours (FT) TSM 10.8G 24.35M 43.5 72.8 57.4 84.8
Ours (E2E) TSM 10.8G 24.35M 44.6 73.3 58.3 85.5
Table 4: Comparison between our method and Bicubic downscaler. FT denotes only fine-tuning the action recognition CNN. E2E denotes also fine-tuning the dowscaler.

In the second group, we provide the experimental results when the action recognition CNN is fine-tuned on the low-resolution videos from Bicubic and our downscaler in SelfC. The details of fine-tuning procedure are in the the supplementary material. It is obvious that our method clearly outperforms bicubic downscaling by about in terms of Top1 accuracy on the two datasets. Notably, our downscaler is learnable. Therefore, we then fine-tune action recognition CNNs and our downscaler jointly. The results are in the third group. The end-to-end joint training further improves the performance by an obvious margin. On Something V2, the ultimate performances of our method nearly achieve that of performing recognition directly on HR videos, and our method improves the efficiency by over 3 times. The downscaler of IRN can not improve the efficiency of this task because its computation cost is even larger than the HR setting. We try to decrease the layer number of IRN but it no longer converges.

4.6 Ablation Studies on the Framework

In this section, we conduct experiments on video rescaling task to verify the effectiveness of the components in our framework. We first define the following 2 baselines: (1) IRN [62]

, which is the most recent state-of-the-art image rescaling method. For fair comparison, we retrain it on Vimeo90K dataset using the codes open-sourced by the authors. (2) Auto-Enc, which is a simple auto encoder-decode architecture by removing the STP-Net of our model. The experimental results are shown in Tab. 


Methods Backbone Probability model Param(M) Vid4(Y)
Auto-Enc 16Dense2D-T - 3.63 27.82 0.8297
IRN 16Dense2D
4.36 29.48 0.8859
IRN 16Dense2D-T
3.63 29.01 0.8786
SelfC-basic 2Dense2D GMM(K=1) 0.73 29.22 0.8842
SelfC-basicT 2Dense2D-T GMM(K=1) 0.64 29.67 0.8864
SelfC-small 2Dense2D-T GMM(K=5) 0.68 30.28 0.9021
SelfC-large 8 Dense2D-T GMM(K=5) 2.65 31.11 0.9261
Table 5: Ablation studies on 4 video rescaling. All blocks adopt the INN architecture for fair comparison with IRN.

First, the Auto-Enc baseline shows more inferior performance than both IRN and our method. This proves that explicitly modeling the lost information is important. IRN is inferior to our small model although IRN leverages an 8 times heavier backbone. We also tried to equip IRN with the temporal modeling ability by replacing its backbone from Dense2D to Dense2D-T. Surprisingly, the performance of the new model IRN

decreases by 0.47dB. The reason is that IRN relies on the complex non-linear transformation to transform the real distribution of the lost information to the normal distribution while the transformation ability of the Dense2D-T is weaker (missing 0.73M parameters).

For our method, we start from the most simple model denoted by SelfC-basic, where backbone consists of only spatial convolutions, and the STP-Net only outputs a simple Gaussian distribution. The performance of this model is comparable with IRN but with 6 fewer parameters. This proves the efficiency and superiority of the proposed self-conditioned distribution modeling scheme. Then, we introduce an improved model denoted by SelfC-basicT. The temporal modeling ability of it is stronger by changing the basic block from Dense2D to Dense2D-T. This leads to 0.45dB improvement while reducing the parameters, proving the effectiveness of the Dense2D-T block for video task. Further, we increase the mixture number of the GMM model to 5. The resulted SelfC-small model outperforms all the baselines by a large margin (30.28dB) with only 0.68M parameters. Our model is also scalable with larger backbone network. Enlarging that by 4 times improves the performance by 0.83dB. For more ablation studies on the depth of backbone network, comparison of different probabilistic modeling methods, the architecture of the STP-Net and the loss functions, please refer to the supplementary material.

4.7 Visualization Results

While the previous quantitative results validate the superiority of the proposed self-conditioned modeling scheme on several tasks, it is interesting to investigate the intermediate components output by our model, especially the distribution of the high-frequency (HF) component predicted by STP-Net. Note that the distribution is a mixture of Gaussian and includes multiple channels, we draw two samples of from and randomly select 1 channel of them for visualization. The output from the frequency analyzer is adopted as the ground-truth sample.

As shown in Fig. 7, we first see that the LR video downscaled by our method is modulated into some mandatory information for reconstructing the HF components more easily, compared to Bicubic. Also, the sampled HF components can restore the ground-truth of that accurately in terms of key structures, i.e., the windows of the building, while retaining a certain degree of the randomness. This is consistent with our learning objectives.

5 Conclusion

We have proposed a video-rescaling framework to learn a pair of downscaling and upscaling operations. Extensive experiments demonstrated that our method can outperform the previous methods with a large margin while with much fewer parameters and computational cost. Moreover, the learned downscaling operator facilitates the tasks of video compression and efficient action recognition significantly.


  • [1] E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici (2020) Scale-space flow for end-to-end optimized video compression. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8503–8512. Cited by: §2.
  • [2] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv. Cited by: §3.4, §4.4.
  • [3] G. Bjontegaard (2001) Calculation of average psnr differences between rd-curves. VCEG-M33. Cited by: §4.4.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §2.
  • [5] A. Djelouah, J. Campos, S. Schaub-Meyer, and C. Schroers (2019) Neural inter-frame compression for video coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6421–6429. Cited by: §2.
  • [6] Y. Fan, X. Lu, D. Li, and Y. Liu (2016) Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In Proceedings of the 18th ACM international conference on multimodal interaction, pp. 445–450. Cited by: §1.
  • [7] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In ICCV, Cited by: §2.
  • [8] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In CVPR, Cited by: §2.
  • [9] D. Fuoli, S. Gu, and R. Timofte (2019) Efficient video super-resolution through recurrent latent space propagation. In ICCVW, Cited by: Table 1.
  • [10] P. W. Glynn and R. Szechtman (2002) Some new perspectives on the method of control variates. In Monte Carlo and Quasi-Monte Carlo Methods 2000, Cited by: §3.6.
  • [11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. In ICCV, Cited by: §4.1.
  • [12] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud (2017) Backpropagation through the void: optimizing control variates for black-box gradient estimation. arXiv. Cited by: §3.6.
  • [13] A. Graves (2016) Stochastic backpropagation through mixture density distributions. arXiv. Cited by: §3.5.
  • [14] A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen (2019)

    Video compression with rate-distortion autoencoders

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7033–7042. Cited by: §2.
  • [15] K. Hara, H. Kataoka, and Y. Satoh (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In ICCVW, Cited by: §2.
  • [16] M. Haris, G. Shakhnarovich, and N. Ukita (2019) Recurrent back-projection network for video super-resolution. In CVPR, Cited by: §2, Table 1, Table 2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2.
  • [18] Z. Hu, Z. Chen, D. Xu, G. Lu, W. Ouyang, and S. Gu (2020) Improving deep video compression by resolution-adaptive flow coding. In ECCV, Cited by: §4.1.
  • [19] Z. Hu, G. Lu, and D. Xu (2021) FVC: a new framework towards deep video compression in feature space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1502–1511. Cited by: §2.
  • [20] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer (2014) Densenet: implementing efficient convnet descriptor pyramids. In arXiv, Cited by: §3.1.
  • [21] T. Isobe, X. Jia, S. Gu, S. Li, S. Wang, and Q. Tian (2020) Video super-resolution with recurrent structure-detail network. In ECCV, Cited by: §4.1, Table 1, Table 2.
  • [22] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. Li, S. Wang, and Q. Tian (2020) Video super-resolution with temporal group attention. In CVPR, Cited by: §4.1, §4.4, Table 1.
  • [23] Y. Jo, S. Wug Oh, J. Kang, and S. Joo Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In CVPR, Cited by: §2, Table 1, Table 2.
  • [24] H. Kim, M. Choi, B. Lim, and K. Mu Lee (2018) Task-aware image downscaling. In ECCV, Cited by: §1, §2, §3.2.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In arXiv, Cited by: §4.2.
  • [26] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv. Cited by: §3.5.
  • [27] Y. Li, D. Liu, H. Li, L. Li, Z. Li, and F. Wu (2018) Learning a convolutional neural network for image compact-resolution. In TIP, Cited by: §1, §2.
  • [28] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In ICCV, Cited by: §2, §4.2.
  • [29] J. Lin, D. Liu, H. Li, and F. Wu (2020) M-lvc: multiple frames prediction for learned video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3546–3554. Cited by: §2.
  • [30] C. Liu and D. Sun (2013) On bayesian adaptive video super resolution. In TPAMI, Cited by: §1, §2, §4.1, Table 1.
  • [31] G. Lu, C. Cai, X. Zhang, L. Chen, W. Ouyang, D. Xu, and Z. Gao (2020) Content adaptive and error propagation aware deep video compression. In European Conference on Computer Vision, pp. 456–472. Cited by: §2, §4.1.
  • [32] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao (2019) Dvc: an end-to-end deep video compression framework. In CVPR, Cited by: §2, Figure 5, §4.1, §4.4.
  • [33] G. Lu, W. Ouyang, D. Xu, X. Zhang, Z. Gao, and M. Sun (2018)

    Deep kalman filtering network for video compression artifact reduction

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 568–584. Cited by: §1.
  • [34] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu (2020) An end-to-end learning framework for video compression. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [35] G. Lu, X. Zhang, W. Ouyang, D. Xu, L. Chen, and Z. Gao (2019) Deep non-local kalman network for video compression artifact reduction. IEEE Transactions on Image Processing 29, pp. 1725–1737. Cited by: §1.
  • [36] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic (2018) Fine-grained video classification and captioning. In arXiv, Cited by: §4.1.
  • [37] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    In NeurIPS, Cited by: §4.2.
  • [38] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli (2003) Image denoising using scale mixtures of gaussians in the wavelet domain. In TIP, Cited by: §2.
  • [39] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, Cited by: §2.
  • [40] D. A. Reynolds (2009) Gaussian mixture models.. In EOB, Cited by: §3.2.
  • [41] D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. In ICML, Cited by: §3.5.
  • [42] M. S. Sajjadi, R. Vemulapalli, and M. Brown (2018) Frame-recurrent video super-resolution. In CVPR, Cited by: Table 1, Table 2.
  • [43] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, Cited by: §3.1.
  • [44] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Cited by: §1, §2.
  • [45] V. Strela, J. Portilla, and E. P. Simoncelli (2000) Image denoising using a local gaussian scale mixture model in the wavelet domain. In Wavelet Applications in Signal and Image Processing VIII, Cited by: §2.
  • [46] G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand (2012) Overview of the high efficiency video coding (hevc) standard. In TCSVT, Cited by: §1, §2, §4.1.
  • [47] W. Sun and Z. Chen (2020) Learned image downscaling for upscaling using content adaptive resampler. In TIP, Cited by: §1.
  • [48] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. In ICCV, Cited by: §1, §2, §4.1, Table 1.
  • [49] Y. Tian, Z. Che, W. Bao, G. Zhai, and Z. Gao (2020) Self-supervised motion representation via scattering local motion cues. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 71–89. Cited by: §2.
  • [50] Y. Tian, X. Min, G. Zhai, and Z. Gao (2019) Video-based early asd detection via temporal pyramid networks. In ICME, Cited by: §1.
  • [51] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §2.
  • [52] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §2.
  • [53] Ultra video group test sequences., accessed: 2019- 11-06. Cited by: §4.1.
  • [54] Video trace library., accessed: 2019- 11-06. Cited by: §4.1.
  • [55] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky (2001) Random cascades on wavelet trees and their use in analyzing and modeling natural images. In ACHA, Cited by: §2.
  • [56] M. J. Wainwright and E. Simoncelli (1999) Scale mixtures of gaussians and the statistics of natural images. In NeurIPS, Cited by: §2.
  • [57] H. Wang, W. Gan, S. Hu, J. Y. Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C. J. Kuo (2016) MCL-jcv: a jnd-based h. 264/avc video quality assessment dataset. In ICIP, Cited by: §4.1.
  • [58] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2018) Temporal segment networks for action recognition in videos. In TPAMI, Cited by: §2.
  • [59] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy (2019) Edvr: video restoration with enhanced deformable convolutional networks. In CVPRW, Cited by: §1, §2, §4.1, Table 1.
  • [60] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In ACSSC, Cited by: §4.4.
  • [61] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra (2003) Overview of the h. 264/avc video coding standard. TCSVT. Cited by: §1, §2.
  • [62] M. Xiao, S. Zheng, C. Liu, Y. Wang, D. He, G. Ke, J. Bian, Z. Lin, and T. Liu (2020) Invertible image rescaling. In ECCV, Cited by: §1, §2, §4.2, §4.6, Table 1, Table 2.
  • [63] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. In arXiv, Cited by: Figure 2.
  • [64] Z. Xu, Y. Yang, and A. G. Hauptmann (2015) A discriminative cnn video representation for event detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1798–1807. Cited by: §1.
  • [65] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. In IJCV, Cited by: §1, §2, §4.1, Table 1, Table 2.
  • [66] Z. Yan, G. Li, Y. TIan, J. Wu, S. Li, M. Chen, and H. V. Poor (2021)

    DeHiB: deep hidden backdoor attack on semi-supervised learning via adversarial perturbation


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 35, pp. 10585–10593. Cited by: §1.
  • [67] F. Yang, Q. Sun, H. Jin, and Z. Zhou (2020) Superpixel segmentation with fully convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13964–13973. Cited by: §2.
  • [68] R. Yang, F. Mentzer, L. V. Gool, and R. Timofte (2020) Learning for video compression with hierarchical quality and recurrent enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6628–6637. Cited by: §2.
  • [69] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2019) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In ICCV, Cited by: Table 1, Table 2.