With the increasing popularity of video capturing devices, a tremendous amount of high-resolution (HR) videos are shot every day. These HR videos are often downscaled to save storage space and streaming bandwidth, or to fit screens with lower resolutions. It is also common that the downscaled videos need to be upscaled for display on HR monitors [kim2018task, li2018learning, sun2020learned, chen2020hrnet, xiao2020invertible].
In this paper, we address the joint optimization of video downscaling and upscaling as a combined task, which is referred to as video rescaling
. This task involves downscaling an HR video into a low-resolution (LR) one, followed by upscaling the resulting LR video back to HR. Our aim is to optimize the HR reconstruction quality while regularizing the LR video to offer comparable visual quality to the bicubic-downscaled video for human perception. It is to be noted that the rescaling task differs from the super-resolution task; at inference time, the former has access to the HR video while the latter has no such information.
One straightforward solution to video rescaling is to downscale an HR video by predefined kernels and upscale the LR video with super-resolution methods [8100101, lim2017enhanced, zhang2018image, wang2018esrgan, dai2019second, guo2020closed, caballero2017real, tao2017detail, sajjadi2018frame, jo2018deep, wang2019edvr, yi2019progressive, isobe2020vide, li2020mucan, isobe2020video]. With this solution, the downscaling is operated independently of the upscaling although the upscaling can be optimized for the chosen downscaling kernels. The commonly used downscaling (e.g. bicubic) kernels suffer from losing the high-frequency information [shannon1949communication] inherent in the HR video, thus creating a many-to-one mapping between the HR and LR videos. Reconstructing the HR video by upscaling its LR representation becomes an ill-posed problem. The independently-operated downscaling misses the opportunity of optimizing the downscaled video to mitigate the ill-posedness.
The idea of jointly optimizing downscaling and upscaling was first proposed for image rescaling [kim2018task, li2018learning, sun2020learned, chen2020hrnet]. It adds a new dimension of thinking to the studies of learning specifically to upscale for a given downscaling method [8100101, lim2017enhanced, zhang2018image, wang2018esrgan, dai2019second, guo2020closed]. Recognizing the reciprocality of the downscaling and upscaling operations, IRN [xiao2020invertible] recently introduced a coupling layer-based invertible model, which shows much improved HR reconstruction quality than the non-invertible models.
These jointly optimized image-based solutions (Fig. 1(a)) are not ideal for video rescaling. For example, a large number of prior works [caballero2017real, tao2017detail, sajjadi2018frame, jo2018deep, wang2019edvr, yi2019progressive, isobe2020vide, li2020mucan, isobe2020video] for video upscaling have adopted the Multi-Input Single-Output (MISO) strategy to reconstruct one HR frame from multiple LR frames and/or previously reconstructed HR frames (Fig. 1(b)). They demonstrate the potential for recovering the missing high-frequency component of a video frame from temporal information. However, image-based solutions do not consider temporal information. In addition, two issues remain widely open as (1) how video downscaling and upscaling could be jointly optimized and (2) how temporal information could be utilized in the joint optimization framework to benefit both operations.
In this paper, we present two joint optimization approaches to video rescaling: Long Short-Term Memory Video Rescaling Network (LSTM-VRN) and Multi-Input Multi-Output Video Rescaling Network (MIMO-VRN). The LSTM-VRN downscales an HR video frame-by-frame using a similar coupling architecture to [xiao2020invertible]
, but fuses multiple downscaled LR video frames via LSTM to estimate the missing high-frequency component of an LR video frame for upscaling (Fig.1(b)). LSTM-VRN shares similar downscaling and upscaling strategies to the traditional video rescaling framework. In contrast, our MIMO-VRN introduces a completely new paradigm by adopting the MIMO strategy for both video downscaling and upscaling (Fig. 1(c)). We develop a group-of-frames-based (GoF) coupling architecture that downscales multiple HR video frames simultaneously, with their high-frequency components being estimated also simultaneously in the upscaling process. Our contributions include the following:
To the best of our knowledge, this work is the first attempt at jointly optimizing video downscaling and upscaling with invertible coupling architectures.
Our LSTM-VRN and MIMO-VRN outperform the image-based invertible model [xiao2020invertible], showing significantly improved HR reconstruction quality and offering LR videos comparable to the bicubic-downscaled video in terms of visual quality.
Our MIMO-VRN is the first scheme to introduce the MIMO strategy for video upscaling and downscaling, achieving the state-of-the-art performance.
2 Related Work
This section surveys video rescaling methods, with a particular focus on their downscaling and upscaling strategies. We regard the image-based rescaling methods as possible solutions for video rescaling. Fig. 2 is a taxonomy of these prior works.
2.1 Upscaling with Predefined Downscaling
The traditional image super-resolution [8100101, lim2017enhanced, zhang2018image, wang2018esrgan, dai2019second, guo2020closed] or video super-resolution [caballero2017real, tao2017detail, sajjadi2018frame, jo2018deep, wang2019edvr, yi2019progressive, isobe2020vide, li2020mucan, isobe2020video] methods are candidate solutions to video upscaling. The former is naturally a single-input single-output (SISO) upscaling strategy, which generates one HR image from one LR image. The latter usually involves more than one LR video frame in the upscaling process, i.e. the MISO upscaling strategy, in order to leverage temporal information for better HR reconstruction quality. Most of the approaches in this category adopt a SISO downscaling strategy with a pre-defined kernel (e.g. bicubic) chosen independently of the upscaling process. Therefore, they are unable to adapt the downscaled images/videos to the upscaling.
2.2 Upscaling with Jointly Learned Downscaling
To mitigate the ill-posedness of the image upscaling task, some works learn upscaling and downscaling jointly by encoder-decoder architectures [kim2018task, li2018learning, sun2020learned, chen2020hrnet]. They turn the fixed downscaling method into a learnable model in order to adapt the LR image to the upscaling process that is learned jointly. The training objective usually requires the LR image to be also suitable for human perception. Recently, IRN [xiao2020invertible] introduces an invertible model [DBLP:journals/corr/DinhKB14, DBLP:conf/iclr/DinhSB17, kingma2018glow] to this joint optimization task. It is able to perform image downscaling and upscaling by the same set of neural networks configured in the reciprocal manner. It provides a means to model explicitly the missing high-frequency information due to downscaling by a Gaussian noise.
2.3 Invertible Rescaling Network
IRN [xiao2020invertible] is an invertible model designed specifically for image rescaling. The forward model of IRN comprises a 2-D Haar transform and eight coupling layers [DBLP:journals/corr/DinhKB14, DBLP:conf/iclr/DinhSB17, kingma2018glow], as shown in Fig. 3. By applying the 2-D Haar transform, an input image is first decomposed into one low-frequency band and three other high-frequency bands . These two components are subsequently processed via the coupling layers in a way that the output becomes a visually-pleasing LR image and the encodes the complementary high-frequency information inherent in the input HR image . In theory, the inverse coupling layers can recover losslessly from and because the model is invertible. In practice, is unavailable for upscaling at inference time. The training of IRN requires
to follow a Gaussian distribution so that at inference time, a Gaussian samplecan be drawn as a substitute for the missing high-frequency component.
Although IRN achieves superior results on the image rescaling task, it is not optimal for video rescaling. Essentially, IRN is an image-based method. This work presents the first attempt at jointly optimizing video downscaling and upscaling with an invertible coupling architecture (Fig. 3).
3 Proposed Method
Given an HR video composed of video frames , where , the video rescaling task involves (1) downscaling every video frame to its LR counterpart , where the quantized version of which forms collectively an LR video , and (2) upscaling the LR video to arrive at the reconstructed HR video . Unlike most video super-resolution tasks, which focus primarily on learning upscaling for a given downscaling method, this work optimizes jointly the downscaling and upscaling as a combined task. It has been shown in many traditional video super-resolution works [caballero2017real, tao2017detail, sajjadi2018frame, haris2019recurrent, tian2020tdan, wang2019edvr, jo2018deep, yi2019progressive, isobe2020video, li2020mucan, isobe2020vide] that the extra temporal information in videos allows the lost high-frequency component of a video frame due to downscaling to be recovered to some extent. This work makes the first attempt to explore how such temporal information could assist downscaling in producing an LR video that can be upscaled to offer better super-resolution quality in an end-to-end fashion. In a sense, our focus is on both downscaling and upscaling. The objective is to minimize the distortion between and in such a combined task while the LR video is regularized to offer comparable visual quality to the bicubic-downscaled video for human perception. It is to be noted that the LR video is not meant to be exactly the same as the bicubic-downscaled video since doing so may not lead to the optimal downscaling and upscaling in our task.
The reciprocality of the downscaling and upscaling operations motivates us to choose an invertible network for our task. With the superior performance of coupling layer architectures in recovering high-frequency details of LR images [xiao2020invertible], we develop our downscaling and upscaling networks, especially for video, using a similar invertible architecture (Sec. 2.3) as the basic building block.
We propose two approaches, LSTM-VRN and MIMO-VRN, to configure or extend these building blocks for joint learning of video downscaling and upscaling. Their overall architectures are depicted in Fig. 4, with detailed operations given in the following sections.
3.1 LSTM-based Video Rescaling Network
Like most video super-resolution techniques, the LSTM-VRN (Fig. 4(a)) adopts the SISO strategy to downscale HR video frames individually to their LR ones by the forward model of the invertible network. The operation is followed by the MISO-based upscaling, which departs from the idea of drawing an input-agnostic Gaussian noise [xiao2020invertible] for complementary high-frequency information. Specifically, we fuse the current LR frame and its neighbouring frames by a LSTM-based predictive module to form an estimate of the missing high-frequency component at inference time. The resulting is fed to the inverse model together with the for reconstructing the HR video frame . The fact that needs to be estimated from multiple LR frames determines what information should remain in the LR video to facilitate the prediction. This connects the upscaling process tightly to the downscaling process, stressing the importance of their joint optimization. In addition, we rely on the inter-branch pathways of the coupling layer in the forward model to correlate and in such a way that could be better predicted from and its neighbors .
The predictive module plays a key role in fusing information from and . We incorporate Spatiotemporal-LSTM (ST-LSTM) [wang2017predrnn] for propagating temporal information in both forward and backward directions, in view of its recent success in video extrapolation tasks. Eq. (1) details the forward mode of the predictive module for time instance :
is a sigmoid function,is the standard convolution, and is Hadamard product. Note that an attention signal guided by the the current LR frame combines the temporally-propagated hidden information and the features of to yield the output . As Fig. 4(a) shows, the forward propagated is further combined with the backward propagated to predict through a 1x1 convolution.
For upscaling every LR video frame , the proposed predictive module works in a sliding-window manner with a window size of . That is, the forward (respectively, backward) ST-LSTM always starts with a reset state 0 when accepting the input (respectively, ). This design choice is out of generalization and buffering considerations. We avoid running a long ST-LSTM at inference time because the training videos are rather short. Moreover, the backward ST-LSTM introduces delay and buffering requirements.
Finally, we note in passing that LSTM-VRN exploits temporal information across LR video frames only for upscaling while its downscaling is still a SISO-based scheme, which does not take advantage of temporal information in HR video frames for downscaling.
3.2 MIMO-based Video Rescaling Network
Our MIMO-VRN (Fig. 4(b)) is a new attempt that adopts a MIMO strategy for both upscaling and downscaling, making explicit use of temporal information in these operations. Here, we propose a new basic processing unit, called Group of Frames (GoF). To begin with, the HR input video is decomposed into non-overlapping groups of frames, with each group including frames, namely . The downscaling proceeds on a group-by-group basis; each GoF is downscaled independently of each other. Within a GoF, every HR video frame is first transformed individually using 2-D Haar Wavelet, to arrive at its low-frequency and high-frequency components. We then group the low-frequency components in a GoF as one group-type input to the coupling layers (i.e. replacing in Fig. 3 with ) and the remaining high-frequency components as the other group-type input (i.e. replacing in Fig. 3 with ). Because each group-type input contains information from multiple video frames, the coupling layers are able to utilize temporal information inherent in one group-type input to update the other. With two downscaling modules, the results are a group of quantized LR frames and the group high-frequency component . It is worth noting that due to the nature of group-based coupling, there is no one-to-one correspondence between the signals in and .
The upscaling proceeds also on a group-by-group basis, with the group size and the group formation fully aligned with those used for downscaling. As depicted in Fig. 4(b), we employ a residual block-based predictive module to form a prediction of the missing high-frequency components from the corresponding group of LR frames . Similar to the notion of the group-type inputs for downscaling, the LR frames and the estimated high-frequency components comprise respectively the two group-type inputs and to the invertible network operated in inverse mode. With this MIMO-based upscaling, a group of HR frames are reconstructed simultaneously.
3.3 Training Objectives
The training of LSTM-VRN involves two loss functions to reflect our objectives. First, to ensure that the LR videois visually pleasing, we follow common practice to require that have similar visual quality to the bicubic-downscaled video ; to this end, we define the LR loss as
Second, to maximize the HR reconstruction quality, we minimize the Charbonnier loss  between the original HR video and its reconstructed version subject to downscaling and upscaling:
where is set to . The total loss is , where is a hyper-parameter used to trade-off between the quality of the LR and HR videos.
MIMO-VRN. The training of MIMO-VRN shares the same and losses as LSTM-VRN because they have common optimization objectives. We however notice that MIMO-VRN tends to have uneven HR reconstruction quality over video frames in a GoF (Sec. 4.4). To mitigate the quality fluctuation in a GoF, we additionally introduce the following center loss for MIMO-VRN:
where is the group size, denotes the average HR reconstruction error in a GoF, and is the number of GoF’s in a sequence. Eq. (4) encourages the HR reconstruction error of every video frame in a GoF to approximate the average level .
4 Experimental Results
Datasets. For a fair comparison, we follow the common test protocol to train our models on Vimeo-90K dataset [Xue_2019]. It has 91,701 video sequences, each is 7 frames long. Among them, 64,612 sequences are for training and 7,824 are for test. Each sequence has a fixed spatial resolution of 448 256. The performance evaluation is done on two standard test datasets, Vimeo-90K-T and Vid4 . Vid4 includes 4 video clips, each having around 40 frames.
|SISO||SISO||DRN-L [guo2020closed]||22.47 / 0.7289||26.25 / 0.7011||24.88 / 0.6681||28.84 / 0.8752||25.61 / 0.7433|
|CAR [sun2020learned]||24.48 / 0.8143||30.19 / 0.8444||26.98 / 0.7841||31.59 / 0.9250||28.28 / 0.8421|
|IRN [xiao2020invertible]||26.62 / 0.8850||33.48 / 0.9337||29.71 / 0.8871||35.36 / 0.9696||31.29 / 0.9188|
|MISO||DUF [jo2018deep]||24.04 / 0.8110||28.27 / 0.8313||26.41 / 0.7709||30.60 / 0.9141||27.33 / 0.8318|
|EDVR-L [wang2019edvr]||24.05 / 0.8147||28.00 / 0.8122||26.34 / 0.7635||31.02 / 0.9152||27.35 / 0.8264|
|PFNL [yi2019progressive]||24.37 / 0.8246||28.09 / 0.8385||26.51 / 0.7768||30.65 / 0.9135||27.40 / 0.8384|
|TGA [isobe2020video]||24.47 / 0.8286||28.37 / 0.8419||26.59 / 0.7793||30.96 / 0.9181||27.59 / 0.8419|
|RSDN [isobe2020vide]||24.60 / 0.8355||29.20 / 0.8527||26.84 / 0.7931||31.04 / 0.9210||27.92 / 0.8505|
|LSTM-VRN||27.31 / 0.9039||34.36 / 0.9482||31.13 / 0.9213||36.18 / 0.9742||32.24 / 0.9369|
|MIMO||MIMO||MIMO-VRN||29.23 / 0.9389||35.49 / 0.9573||33.25 / 0.9535||37.17 / 0.9812||33.79 / 0.9577|
|MIMO-VRN-C||28.83 / 0.9322||35.13 / 0.9544||32.72 / 0.9476||36.93 / 0.9808||33.40 / 0.9537|
Implementation and Training Details. Our proposed models adopt the settings from IRN [xiao2020invertible], which consists of two downscaling modules (Fig. 3). Each module is composed of one 2-D Haar transform and eight coupling layers. Both LSTM-VRN and MIMO-VRN have eight predictive modules (Fig. 4) replicated and stacked for a better prediction of the missing high-frequency component. The sliding window size for LSTM-VRN is set to 7, which includes the current LR video frame together with 6 neighbouring LR frames (3 from the past and 3 from the future). The GoF size for MIMO-VRN is set to 5. For data augmentation, we randomly crop training videos to 144 144 as HR inputs and use their bicubic-downscaled versions (of size 36 36) as LR ground-truths. We also apply random horizontal and vertical flipping. LSTM-VRN and MIMO-VRN share the same LR and HR training objectives (Eq. (2) and Eq. (3)), with the for set to 64. The training of MIMO-VRN additionally includes the center loss (Eq. (4)), the hyper-parameter of which is chosen to be 16. We use Adam optimizer [DBLP:journals/corr/KingmaB14], with , and a batch size of 16. The weight decay is set to . We use an initial learning rate of , which is decreased by half for every iterations. Our code is available online 111https://ding3820.github.io/MIMO-VRN/.
|SISO||SISO||DRN-L [guo2020closed]||35.63 / 0.9262|
|CAR [sun2020learned]||37.69 / 0.9493|
|IRN [xiao2020invertible]||40.83 / 0.9734|
|MISO||DUF [jo2018deep]||36.37 / 0.9387|
|EDVR-L [wang2019edvr]||37.63 / 0.9487|
|TGA [isobe2020video]||37.59 / 0.9516|
|RSDN [isobe2020vide]||37.23 / 0.9471|
|LSTM-VRN||41.42 / 0.9764|
|MIMO||MIMO||MIMO-VRN||43.26 / 0.9846|
|MIMO-VRN-C||42.53 / 0.9820|
Baselines. We include three categories of baselines for comparison: (1) SISO-down-SISO-up with predefined downscaling kernels (e.g. DRN-L [guo2020closed]), (2) SISO-down-SISO-up with jointly optimized downscaling and upscaling (e.g. CAR [sun2020learned] and IRN [xiao2020invertible]), and (3) SISO-down-MISO-up with predefined downscaling kernels (e.g. DUF [jo2018deep], EDVR-L [wang2019edvr], PFNL [yi2019progressive], TGA [isobe2020video], and RSDN [isobe2020vide]
). The first two categories perform video downscaling and upscaling on a frame-by-frame basis. The third category includes the state-of-the-art video super-resolution methods, where the predefined downscaling is done frame-by-frame and the learned upscaling is MISO-based. The predefined downscaling uses the bicubic interpolation method. It is to be noted that the methods adopting the learned downscaling perform upscaling based on their respective LR videos, which would not be the same as the bicubic-downscaled videos. The results for the methods in categories (1) and (2) are produced using the pre-trained models released by the authors. Those in category (3) are taken from the papers since these baselines share exactly the same setting as ours. We report results for a downscaling/upscaling factor of 4 only, following the common setting for video rescaling.
For quantitative comparison, we adopt the standard test protocol in the super-resolution tasks to evaluate Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM)[wang2004image] on the Y channel, denoted respectively by PSNR-Y and SSIM-Y.
4.2 Comparison of Quantitative Results
Tables 1 and 2 report the PSNR-Y and SSIM-Y results of the reconstructed HR videos on Vid4 and Vimeo-90K-T. Table 3 summarizes the results for the downscaled videos. The following observations are immediate:
(1) Optimizing jointly video downscaling and upscaling improves the HR reconstruction quality. This is confirmed by the fact that LSTM-VRN achieves considerably higher PSNR-Y (32.24dB on Vid4 and 41.42dB on Vimeo-90K-T) than the baselines with video super-resolution methods for upscaling (27.33-27.92dB on Vid4 and 36.37-37.59dB on Vimeo-90K-T) [wang2019edvr, jo2018deep, yi2019progressive, isobe2020video, isobe2020vide], which adopt the same SISO-down-MSIO-up strategy yet with a predefined downscaling kernel. We note that the image-based joint optimization schemes, e.g. IRN [xiao2020invertible] and CAR [sun2020learned], achieve better HR reconstruction quality than the traditional video-based baselines, even without using temporal information for upscaling. The superior performance of joint optimization schemes is attributed to the fact that they can better embed HR information in LR frames for upscaling.
(2) Incorporating temporal information in the LR video improves further on the HR reconstruction quality. The result is evidenced by the 0.95dB and 0.59dB PSNR-Y gains of LSTM-VRN over IRN [xiao2020invertible] on Vid4 and Vimeo-90K-T. Both share a similar invertible network for downscaling, but our LSTM-VRN additionally leverages information from multiple LR video frames to predict the high-frequency component of a video frame during upscaling.
(3) MIMO-VRN achieves the best PSNR-Y/SSIM-Y results. It outperforms LSTM-VRN by 1.55dB and 1.84dB in PSNR-Y on Vid4 and Vimeo-90K-T, respectively, while LSTM-VRN already shows a significant improvement over the other baselines. The inclusion of the center loss (see MIMO-VRN-C) causes a modest decrease in PSNR-Y/SSIM-Y but helps to alleviate the quality fluctuation in both the resulting LR and HR videos (Sec. 4.4). These results highlight the benefits of incorporating temporal information into both downscaling and upscaling in an end-to-end optimized manner.
(4) Both LSTM-VRN and MIMO-VRN produce visually-pleasing LR videos. Table 3 shows that the LR videos produced by our models have a PSNR-Y of more than 40dB when compared against the bicubic-downscaled videos. This together with the SSIM-Y results suggests that they are visually comparable to the bicubic-downscaled videos, as is also confirmed by the subjective quality comparison in Fig. 5 and the supplementary document.
4.3 Comparison of Qualitative Results
Figs. 6 presents a qualitative comparison on Vid4. As shown, our models produce higher-quality HR video frames with much sharper edges and finer details. The other methods show blurry image quality and fail to recover image details. From Fig. 5, our downscaling models produce visually comparable results to the bicubic downscaling method, which indicates the visually-pleasing property of our LR videos. The reader is referred to our project page 1 for more results.
|IRN [xiao2020invertible]||40.77 / 0.9908||46.24 / 0.9956|
|LSTM-VRN||42.36 / 0.9940||47.14 / 0.9968|
|MIMO-VRN||45.05 / 0.9965||49.11 / 0.9975|
|MIMO-VRN-C||45.51 / 0.9969||49.34 / 0.9976|
4.4 Ablation Experiments
Temporal Propagation Methods in LSTM-VRN. Table 4 presents results for three temporal propagation schemes in LSTM-VRN. The first runs LSTM in forward direction without reset. The second and the third implement the proposed method with uni- or bi-directional propagation, respectively. We see that the sliding window-based reset is advantageous to the HR reconstruction quality. This may be attributed to the fact that the training videos in Vimeo-90K are rather short. When trained on Vimeo-90K, the first variant may not generalize well to unseen long videos in Vid4. As expected, with the access to both the past and future LR frames, the bi-directional propagation performs better than the uni-directional one (i.e. Fig. 4(a) without the backward path).
GoF Size. Table 5 studies the effect of the GoF size on MIMO-VRN’s performance. The setting GoF1 reduces to the SISO-up-SISO-down method, which is similar to IRN [xiao2020invertible] except that it introduces a prediction of the high-frequency component from the LR video frame. For a fair comparison, we re-train IRN [xiao2020invertible] on Vimeo-90K and denote the re-trained model by IRN_Ret. Note that the pre-trained IRN [xiao2020invertible] performs better than IRN_Ret since it is trained on a different (image-based) dataset. We see that GoF1 and IRN_Ret show comparable performance, especially on the HR videos. This suggests that without additional temporal information, the prediction of the high-frequency component from the LR video is ineffective. However, increasing the GoF size, which involves more temporal information in downscaling and upscaling, improves the quality of the HR video significantly. GoF5 is seen to be the best setting.
Center Loss. Fig. 7 visualizes the PSNR-Y of the HR and LR videos produced by MIMO-VRN as functions of time. Without the center loss (see MIMO-VRN), the PSNR-Y of both the HR and LR videos fluctuates periodically by as much as 2dB. Observe that the crest points of the HR video occur roughly at the GoF centers while the trough points are at the GoF boundaries. Table 6 performs an ablation study of how this center loss would affect the HR and/or LR videos when it is imposed on these videos. We observe that introducing the center loss largely mitigates the quality fluctuation in the corresponding HR and/or LR video (see the MAD results in Table 6 and Fig. 7). It however degrades the HR and/or LR quality in terms of PSNR-Y, as compared to the case without the loss. We make the choice of imposing the center loss on the HR video only for two reasons. First, this leads to a minimal impact on the HR reconstruction quality. The second is that the quality fluctuation in the LR video is less problematic in terms of subjective quality because the PSNR-Y measured against the bicubic-downscaled video is way above 40dB. On closer visual inspection, these LR videos hardly show any artifacts in the temporal dimension.
4.5 Complexity-performance Trade-offs
LSTM-VRN and MIMO-VRN present different complexity-performance trade-offs. (1) LSTM-VRN is relatively lightweight, having 9M network parameters as compared to 19M with MIMO-VRN. (2) LSTM-VRN does not require additional buffering/delay and storage for downscaling as is necessary for MIMO-VRN. (3) MIMO-VRN has better LR/HR quality while LSTM-VRN has more consistent LR/HR quality temporally. They use depends on the complexity constraints and performance requirements of the application.
This work presents two joint optimization approaches to video rescaling. Both incorporate an invertible network with coupling layer architectures to model explicitly the high-frequency component inherent in the HR video. While our LSTM-VRN shows that the temporal information in the LR video can be utilized to good advantage for better upscaling, our MIMO-VRN demonstrates that the GoF-based rescaling is able to make full use of temporal information to benefit both upscaling and downscaling. Our models demonstrate superior quantitative and qualitative performance to the image-based invertible model. They outperform, by a significant margin, the video rescaling framework without joint optimization.
This work is supported by Qualcomm technologies, Inc. (NAT-439543), Ministry of Science and Technology, Taiwan (109-2634-F-009-020) and National Center for High-performance Computing, Taiwan.
Appendix A Quantitative results based on VMAF
This section presents quantitative results for upscaled and downscaled videos in Vid4 based on the Video Multi-Method Assessment Fusion (VMAF)  metric. VMAF is an objective video quality metric that is shown to correlate highly with human perception. It accepts as inputs a reference video and a distorted video. Its output is a score between 0 and 100. The higher the VMAF score, the more closely the two input videos match each other. Compared in Table 7 are the VMAF scores of the reconstructed high-resolution (HR) videos produced by a few baselines (whose code is available) and our schemes. The reference videos are the original HR videos. Likewise, Table 8 shows the results for the downscaled videos, where the reference videos are the bicubic-downscaled videos.
From Table 7, we observe that our LSTM-VRN, MIMO-VRN and MIMO-VRN-C outperform IRN [xiao2020invertible] (image-based joint optimization scheme) consistently. As compared with EDVR-L [wang2019edvr], a traditional video super-resolution approach, the proposed methods show significantly improved VMAF scores. These observations are in line with the PSNR/SSIM results reported in Table 1 of the main paper. A similar observation can be made regarding the results of the downscaled LR videos in Table 8, which agrees with the PSNR/SSIM results in Table 3 of the main paper. Remarkably, all three proposed methods have VMAF scores very close to 100, suggesting that their LR videos are visually similar to bicubic-downscaled videos.
|Method||Predictive Module||HR (PSNR-Y / SSIM-Y / VMAF)|
|IRN_Ret||30.72 / 0.9087 / 86.37|
|LSTM-VRN||32.24 / 0.9369 / 88.04|
|MIMO-VRN-C-Zero||33.04 / 0.9575 / 89.00|
|MIMO-VRN-C||33.40 / 0.9609 / 89.64|
Appendix B Temporal consistency
In this section, the temporal consistency of the downscaled and upscaled videos is examined. We follow RSDN [isobe2020vide] to extract a row or a column of pixels at the co-located positions in consecutive video frames. We then stitch vertically (or horizontally) these extracted rows (or columns) of pixels to form an image, in order to visualize their variations in the temporal dimension. From Fig. 8, we are not aware of any noticeable inconsistency across reconstructed HR video frames. Moreover, the resulting images of our methods resemble closely the ground-truths.
In Fig. 9, IRN [xiao2020invertible] and LSTM-VRN result in slight temporal inconsistency in some areas of the low-resolution (LR) videos, particularly the Calendar and City sequences. This is evidenced by the aliasing artifact especially appeared on the alphabet of Calendar sequence and the building of City sequence. Nevertheless, such inconsistency does not appear in the LR videos produced by MIMO-VRN and MIMO-VRN-C.
Appendix C The effectiveness of the predictive module
Table 9 provides quantitative results to justify the effectiveness of the proposed predictive module. Recall that the predictive module forms a prediction of the missing high-frequency component from the LR video frames . This new feature distinguishes our schemes from IRN [xiao2020invertible], the image-based joint optimization scheme, which uses a Gaussian noise for . The upper section of Table 9 compares the results of LSTM-VRN with IRN_Ret (the re-trained IRN that uses the same training dataset as LSTM-VRN), validating that the predictive module is effective for reconstructing better HR videos. The lower section of Table 9 conducts the same analysis for MIMO-VRN-C, where we replace
produced by the predictive module with a fixed zero tensor, a scheme termed MIMO-VRN-C-Zero. MIMO-VRN-C-Zero is trained in the same way as MIMO-VRN-C. We see that the predictive module is still effective, even though the gain is less significant than the case in LSTM-VRN.
Appendix D More qualitative results
Fig. 12 presents a frame-by-frame qualitative comparison between MIMO-VRN and MIMO-VRN-C, with a GoF size of 5. MIMO-VRN-C comes with an additional center loss to ensure temporal consistency. It is seen that both MIMO-VRN and MIMO-VRN-C can successfully reconstruct image details. Comparing MIMO-VRN with MIMO-VRN-C, we are not aware of any significant quality variation in the temporal dimension, even though Fig. 7 of the main paper suggests that the HR quality of MIMO-VRN may fluctuate more significantly than MIMO-VRN-C. Fig. 13 displays more LR video frames from Vid4, showing that our models offer comparable visual quality to the bicubic-downscaled videos.