Solution to Proba-V Super Resolution
In video super-resolution, the spatio-temporal coherence between, and among the frames must be exploited appropriately for accurate prediction of the high resolution frames. Although 2D convolutional neural networks (CNNs) are powerful in modelling images, 3D-CNNs are more suitable for spatio-temporal feature extraction as they can preserve temporal information. To this end, we propose an effective 3D-CNN for video super-resolution, called the 3DSRnet that does not require motion alignment as preprocessing. Our 3DSRnet maintains the temporal depth of spatio-temporal feature maps to maximally capture the temporally nonlinear characteristics between low and high resolution frames, and adopts residual learning in conjunction with the sub-pixel outputs. It outperforms the most state-of-the-art method with average 0.45 and 0.36 dB higher in PSNR for scales 3 and 4, respectively, in the Vidset4 benchmark. Our 3DSRnet first deals with the performance drop due to scene change, which is important in practice but has not been previously considered.READ FULL TEXT VIEW PDF
Solution to Proba-V Super Resolution
Official repository of 3DSRnet (ICIP 2019)
Vision is one of the most primitive yet sophisticated sensory systems that is continuously stimulated not only by natural scenes, but also by electric displays. With the unceasingly evolving display hardware which has now commercially reached the resolution of 8K Ultra High Definition (UHD), and people’s rising expectations on these types of visuals, the demand for better quality videos is at its highest. However, the sole advancement of display technologies is not sufficient to offer high quality visual content to users - the contents themselves have to be of higher resolution. Although they can be obtained through the usage of high-end filming equipment, it is costly and problematic due to large storage and transmission bandwidth required.
Super-resolution (SR) is an imaging technique that transforms low resolution (LR) images to higher resolution ones. When an LR image is given as input, an SR algorithm exploits its internal information to generate an output image, hopefully similar to its high resolution (HR) counterpart. This is regarded as an ill-posed problem since multiple HR images correspond to a single LR image. Non-existent, but reasonable, information should be created within the image when going from LR to HR, and finding a high quality image among the possible solutions is the key to the SR problem.
Despite that SR is a popular problem in image processing and computer vision, most studies have focused on single image SR than multi-frame SR, also referred to as video SR. However, many SR applications are in videos where the reconstruction of HR frames may benefit from additional information contained in the previous and future LR frames. While video frames exhibit high temporal coherence, the camera or object motion can also provide a different angle or scale of the parts in the current frame in the consecutive surrounding frames, which can be effectively utilized as crucial clues in constructing high quality HR frames.
A video SR algorithm should fully exploit the temporal relations between the consecutive frames to aggregate them with the spatial information. To this end, we propose an effective 3D convolutional neural network (CNN) for video SR, called 3DSRnet that does not require motion estimation nor compensation to interpret the spatio-temporal information in consecutive frames. Instead, it finds an end-to-end nonlinear spatio-temporal mapping in itself through residual learning and lowers complexity by using the multi-channel output structure introduced in. Our 3DSRnet outperforms the previous video SR methods [2, 3] by at least average 0.36 dB in PSNR for the Vidset4 benchmark test dataset. To the best of our knowledge, it is also the first video SR method that can effectively deal with scene change in the input frames.
Single image SR attempts to develop an HR image from a single LR image. Past attempts to tackle this problem include internal and external example-based methods [4, 5, 6, 7, 8]. The former includes a method devised by Glasner et al. , which identifies internal redundancies in an image to obtain essential information in upscaling of the patches. The external example-based methods try to find a dictionary mapping [4, 5, 8]. Another type of approach is through sparse representation, applied successfully by Yang et al .
With the recent rise of deep learning and the excellent performance of CNNs in image classification, the first structure that adopts a CNN structure for SR was proposed by Dong et al. [11, 10], which suggests a simple 3-layer structure. Their model, called SRCNN, demonstrated great potential of using CNNs for SR applications. Since then, CNN-based structures have been boasting superior performance. One of the CNN-based SR methods that was highly successful is called the very deep super-resolution method (VDSR) proposed by Kim et al . The VDSR has as many as twenty convolution layers and first adopts residual learning to train a deep SR network. However, both SRCNN and VDSR start with enlarged LR images using a bicubic filter, as input to the first convolution layer. Consequently, the convolution operations are taken place on the enlarged input, which leads to high computation complexity. An inspirational work by Shi et al.  suggested a sub-pixel CNN that finds a direct transform from the LR image by using the fact that convolution layers can produce multiple channels at the output. With this multi-channel output structure, the HR image can be obtained through a simple reordering of the output pixels. Our 3DSRnet employs this multi-channel output structure  with residual learning in .
Video SR, or multi-frame SR, assumes that the input is a series of consecutive frames at each time instance of video sequences. Undoubtedly, single image SR algorithms may be applied on the individual frames for videos, and this may even be more efficient in some cases if they achieve real-time performance as in . However, more spatial information is available in the case of videos, as not only the current LR frame but also its surrounding consecutive LR frames may be utilized. This means that to fully profit from what is given, the temporal relation of the spatial information has to be carefully taken into account in reconstructing the corresponding HR frame.
Compared to image SR, relatively less studies have been conducted on video SR. Focusing on neural network based methods, Kappeler et al.  extended SRCNN to 2D-CNN architectures that combine information from neighboring frames. Caballero et al.  proposed three video SR architectures where the early and slow fusion architectures have a similar way of dealing with the multi-frame input as in [3, 13]. The structures in  all adopt the same multi-channel output structure in . The third model is a 3D-CNN architecture that first incorporated 3D convolution filters into video SR to capture temporal information of multiple frames. This model was a conceptual suggestion without the specific configuration information presented, where the temporal depth of feature maps shrinks to one in early convolution layers on which 2D convolutions are then performed. No performance comparison for the 3D-CNN architecture  was provided against other previous methods due to its relatively lower SR performance compared to the early and slow fusion architectures.
In comparison, our 3DSRnet maintains the temporal depth of spatio-temporal feature maps towards deeper layers to maximally capture the temporally nonlinear characteristics between LR and HR frames, and we provide intensive experiments and analysis on 3DSRnet in the later sections of this paper. All three video SR architectures in  and the video SR method in  need motion alignment among the multiple input frames whereas our 3DSRnet directly takes the input frames without any motion alignment. None of the previous CNN-based video SR methods has considered scene change issues while our 3DSRnet first incorporates a scene change detection network to locate a scene change boundary in multiple input frames and replace the different scene frames with the temporally closest frame of the same scene as the current frame scene.
2D-CNN, a common neural network used for images, is a powerful structure in modelling images with their spatial feature extraction capability. However when another axis, time, is introduced as in videos, we argue that 3D-CNN is a more suitable option for spatio-temporal feature extraction. This is in line with Tran et al.  in which they argued that 3D-CNN is an effective video descriptor and with  that demonstrated the spatial and temporal feature extraction capability of 3D-CNNs. 3D-CNNs have been successfully implemented in high level vision tasks for videos such as action/object recognition and scene/event classification [14, 15, 16]. We believe they are also effectively applicable to a low level vision task for videos such as video SR. In this paper, we adopt the 3D-CNN and design an elaborate video SR network, 3DSRnet, which makes the 3D convolution effective on video SR where motion alignment is not necessitated thanks to its spatio-temporal feature representation ability.
We propose a 3D-CNN architecture for video SR named as the 3DSRnet with an additional scene change module that deals with scene change occurring inputs. The 3DSRnet consists of two subnets:
(i) Video SR subnet, and
(ii) Scene change detection and frame replacement (SF) subnet.
The video SR subnet takes a series of consecutive LR input frames in a sliding time window, and produces an HR output frame corresponding to the middle frame in the sliding time window. The SF subnet of the 3DSRnet is responsible for the detection of scene change in the sliding time window, and replaces the frames of a different scene with the temporally closest frame that belongs to the same scene as the middle frame.
The video SR subnet is composed of 3D convolution layers where 3D filters of size are applied on the input composed of multiple consecutive frames or feature maps. Unlike 2D filters of size that are applied on the full depth of the input and slid horizontally and vertically, 3D filters have a third size parameter, depth, so that they are swept horizontally, vertically and depth-wise. The first 3D convolution layer takes a series of five consecutive frames in a sliding input window where each 3D filter generates a temporal feature map (TFM) at the corresponding frame position, each filter yielding a group of temporal feature maps (GTFMs) from all the input frame positions. This is illustrated in Fig. 1 where N 3D filters, of depth 3 are applied on the input composed of five frames to produce N GTFMs. The temporal depth of each GTFM is 3. Formally, the n-th GTFM before activation in the first 3D convolution layer is given by
where is the 3D filter n of size and v is the input.
From the second to the last 3D convolution layer, the input window corresponds to the whole set of multiple GTFMs where each GTFM is generated from one 3D filter of the previous convolution layer. The temporal information contained within the input window is preserved through the 3D convolution layer in each GTFM as separate TFMs unlike 2D convolution layers where the input would be collapsed into one single feature map per filter. From the second to the last 3D convolution layer, the input v is composed of multiple GTFMs (m GTFMs), and the n-th GTFM before activation is given by
The temporal depth of GTFMs become shallower as the network gets deeper, as the 3D filters integrate the temporal information. For example, with an input of five frames and a 3D filter of depth 3, the output GTFMs would have depth 1 after only two 3D convolution layers, being no different from a 2D convolution layer from layers thereafter. Since the usage of 3D-CNNs is to tamper with the temporal information, thereby introducing temporal nonlinearities, extrapolating (or padding) the input GTFMs at their front and back ends allows to preserve the temporal depth throughout the network. However, for the last layers, no extrapolation is performed and the temporal information is aggregated, to produce the final 2D HR frame as intended. For an input of five frames and a 3D filter of depth 3, no extrapolation is carried out from layerL-1 where L is the number of convolution layers. This naturally aggregates the temporal information when going deeper in the network. Please refer to Fig. 2 for a detailed illustration of the 3D convolution layers with extrapolation.
The multi-channel output structure first introduced in  allows for a direct mapping from the LR to HR frames by producing an output with multiple channels that can simply be reordered and reshaped to produce the final HR output. This method alleviates the amount of computation which can be otherwise expensive for 3D-CNNs. Furthermore, it can enhance SR performance because the receptive field of the LR input pixels without bicubic up-scaling is larger than that of an up-scaled LR input pixels, provided that the filter size and network depth are the same. Large receptive fields are essential in SR to yield high performance [12, 18, 19, 20].
HR frames often consist of low and high frequency components. However, the low frequency components are mostly present in the LR frames, meaning that the essential goal of an SR algorithm is in predicting the missing high frequency components. Therefore, the network can save the trouble of predicting what is already there, by directly predicting the difference between the HR frame and the corresponding bicubic-upscaled LR frame - the residual frame -. Our 3DSRnet employs this technique and predicts the residual frame, producing a multi-channel residual output. Residual learning was first proposed in  and applied to SR in . It also eases training 
by solving the vanishing and exploding gradient problem which can be critical in training neural networks. Fig. 3 shows the input and output structures of the 3DSRnet.
The scene change detection and frame replacement (SF) subnet is another component of the 3DSRnet. When multiple frames are used as input to video SR networks, there is a possibility of scene change within them. In this case, the performance of a video SR algorithm drops due to the frames of different scenes getting involved into convolution, resulting in the reconstructed HR frames of poor quality. The previous video SR methods avoided this problem by explicitly collecting data without scene changes, which is impractical in real world applications. Our 3DSRnet handles the scene change problem for video SR by introducing the SF subnet that classifies the exact location of the scene boundary and modifies some of the frames in the sliding input window by replacing the different scene frames with the temporally closest frame of the same scene as the current frame scene. Although the duplicated (replaced) frames do not contain any new information, this method significantly helps the 3DSRnet alleviate performance degradation from the contaminated input of different scene frames.
If we assume that a scene change may occur within the five input frames, there are four possible scene change locations (labels). In addition, the fifth label is designated for no scene change. Then it is a simple five-class classification problem. Fig. 4 illustrates the detailed mechanism of the SF subnet for a sliding time window of five consecutive input frames. The SF subnet should be lightweight as it can be optionally used alongside video SR, but accurate to correctly modify the input. Therefore, we use a shallow 2D-CNN structure. It is trained separately from the video SR subnet.
|Dataset||Type 1||Type 2||Total no.|
|*subimages and frame are shortened as subim and fr, respectively.|
|2D-CNN||3DSRnet v1||3DSRnet v2||3DSRnet v3||3DSRnet|
|Layers||Number of filter channels (input, output)|
|1||2D||5, 32||3D||1, 32||3D||1, 32||3D||1, 32||3D||1, 32|
|2||2D||32, 64||3D||32, 32||3D||32, 32||3D||32, 32||3D||32, 32|
|3||2D||64, 64||3D||32, 16||3D||32, 32||3D||32, 32||3D||32, 32|
|4||2D||64, 64||2D||80, 64||3D||32, 16||3D||32, 32||3D||32, 32|
|5||2D||64, 35||2D||64, 32||2D||80, 64||3D||32, 16||3D||32, 32|
|6||2D||35, 4||2D||32, 4||2D||64, 4||2D||80, 4||2D||32, 4|
|2D filter size||33||33||33||33||33|
|3D filter size||-||333||333||333||333|
|Vidset4||Bayesian ||Deep-DE ||VSRnet ||Liu et al. ||VESPCN ||3DSRnet|
A training or testing data sample of 3DSRnet is composed of five bicubic-down-scaled LR frames and a single HR middle frame. We collected two sets of 38402160 UHD videos of 30 fps that were encoded with at least 100 Mb/s using an H.264/AVC encoder. The first video set (Type 1) shows spatially complex scenes, meaning that they contain sophisticated objects such as the bird view of a city, and the second video set (Type 2) is temporally complex, meaning that there is a lot of motion. We collected three Type 1 videos of total 8,504 frames and one Type 2 video of 8,655 frames. They were converted into 420 YUV format and only the Y channel was used as the training and test data. When reconstructing color frames, U and V channels were simply up-scaled using a bicubic filter.
For training the video SR subnet in 3DSRnet, we prepared two datasets, smallSet and largeSet, where a predefined number of non-overlapping subimages were randomly selected from each frame with a frame stride from Type 1 and Type 2 sets. For fair comparison with other video SR methods, the video SR subnet was trained with a training dataset without scene change. Table 1 summarizes the training sets for the video SR subnet of the 3DSRnet. The size of LR subimages for the scale factors 2, 3 and 4 were 80, 60, 40 for the smallSet and 80, 60, 45 for the largeSet, respectively. The training took around three days with the smallSet and eight days with the largeSet using an Nvidia TITAN X GPU for a scale factor of 2. For the comparison among the video SR subnet of 3DSRnet and its variants, the test set contains the data samples of scenes that are not included in the training set. To compare the video SR subnet of 3DSRnet with the state-of-the-art SR methods, we used the Vidset4 dataset which is a commonly used test set for videos.
For training the SF subnet in 3DSRnet, a separate dataset was created to contain scene changes. The LR frames of different scenes from the smallSet were reduced by a factor of 40 to be of size 4827, and randomly concatenated to make scene change occurring inputs. 2,000 data samples each consisting of the frames and its label were randomly selected for each of the five classes to make the final training data of 10,000 data samples.
The purpose of the video SR subnet was to minimize the mean squared loss between the predicted frame and the ground truth frame , given by
where is the input frames, is the set of model parameters and n is the number of data samples. Then the gradient is calculated as the difference between and . All weights were initialized by the Xavier initialization 
using both the number of input and output neurons of the layer. The parameters were updated using Adam.
All 3D filter sizes were empirically set to 333, 2D filters to size 33 and the number of filters are 64 if not otherwise mentioned. The network is composed of six convolution layers, considering the tradeoff between performance and complexity. The learning rate was set to 5 for the smallSet and for the largeSet. The learning rates of biases are 10 times smaller. For all network models, the weight decay was set to 5 for filters and zero for biases. The mini-batch size is 32 for the smallSet and 64 for the largeSet. All models were implemented using the MatConvNet  package and 3D convolution layers were added using a Matlab mex implementation available in GitHub111https://github.com/pengsun/MexConv3D.
The video SR subnet in 3DSRnet takes a 3D input (multiple LR input frames) in a sliding time window at a time instance and produces one single 2D HR output frame. So its architecture must be devised to go from 3D to 2D. As illustrated in in Fig. 2, the temporal depth of the GTFMs is kept constant until the (L-2)-th convolution layer, and from the (L-1)-th convolution layer, no more temporal extrapolation is done to gradually reduce the temporal depth of GTFMs to 1 for our 3DSRnet. As variants of the 3DSRnet, we also experimented with the combination of 3D and 2D convolution layers by simply concatenating the GTFMs created from the last 3D convolution layer and performing 2D filtering thenceforth, with the number of filters adjusted so that all architectures have a similar number of parameters. The concatenation layer is illustrated in Fig. 5. Table 2 summarizes the specifications and results of our 3DSRnet and its variants, 3DSRnet v1, 3DSRnet v2 and 3DSRnet v3 with the comparison to a 2D-CNN structure, also with a five-frame input, made available to demonstrate the superior feature extraction capability of 3D-CNNs.
Fig. 6 shows the feature maps produced from the first convolution layer of the 2D-CNN and 3DSRnet, experimented in Table 2. Feature maps of 3DSRnet appears much sharper, due to the shorter time window length of three. 2D-CNN convolves all five frames at the first convolution layer, producing more blurry feature maps. Furthermore, a 3D filter in 3DSRnet produces a GTFM containing five TFMs for each time instant, preserving the temporal information.
The video SR subnet of our 3DSRnet extrapolates the GTFMs with a TFM on their both ends to preserve the temporal depth towards the network’s deeper layers. There are different ways of extrapolation such as simply padding them with TFMs filled with zeros or with the outmost TFMs duplicated. However, empirical results showed that there was an insignificant difference in performance with 32.88 dB for duplicate extrapolation and 32.92 dB for zero-filled extrapolation. For simplicity, we choose to use the zero-filled extrapolation.
|*trained with the largeSet.|
|Image Super-resolution||Video Super-resolution||3DSRnet (Ours)|
|Bicubic||SRCNN ||VDSR ||VSRnet ||VESPCN ||smallSet||largeSet|
We test the 3DSRnet for quantitative evaluation in comparison with the state-of-the-art image and video SR methods for the Vidset4 dataset - a popular benchmarking test set that contains four video sequences, namely Calendar, City, Foliage and Walk. The PSNR comparison against other video SR methods on scale 4 is given in Table 3, and the PSNR and SSIM comparison against image and video SR methods on scale factors 2, 3 and 4 are given in Table 5. The results of video SR methods [2, 3] are the reported performance on the same test set. The results of ,  and  are from those reported in . The image SR methods [11, 12] were tested on the set using their respective codes provided by the authors. As shown in Table 3 and 5, our 3DSRnet outperforms all the state-of-the-art image and video SR methods. Note that in Table 5, the 3DSRnet shows higher performance with average 0.45 dB and 0.36 dB, respectively for scales 3 and 4 compared to the best performance version (9L-E3-MC) of VESPCN  which outperformed its 3D-CNN based video SR version. Furthermore from Table 4, our 3DSRnet performs well on all the four sequences without bias toward certain types of videos. Subjective comparisons for the image and video SR methods in Table 5 are shown in Fig. 7.
Although efficient, a disadvantage of the multi-channel output model  is that separate networks have to be trained for different scale factors since the number of output channels should be –- scale to the power of 2. Nevertheless, 3DSRnet with a four-channel output for scale 2 can be trained as a single model for different scales. Specifically, for scale 3, the input frames are first up-scaled by 1.5 times using a bicubic filter and then are fed to the 3DSRnet with scale 2. Similarly, for scale 4, the up-scaled input frames of 2 times are used. For training, we use a dataset that contains a mixture of subimages of all scales 2, 3 and 4 denoted as and , respectively, where and are up-scaled by 1.5 and 2 times to match the size of . Table 6 shows the PSNR performance of the single model trained for all scales 2, 3 and 4 for the Vidset4 dataset. The single model showed the same PSNR performance compared to the separately trained models for scales 2 and 3, but exhibited a slightly higher performance with average 0.2 dB higher in PSNR for scale 4. The single model benefits from the data of various characteristics having diverse frequency ranges even though it is not devoted to learn the training set of a certain scale.
|Detection Accuracy (%)||99.886||99.905|
Scene change often occurs in video sequences, but little attention has been given in video SR. Without a proper treatment for scene change in input frames, performance degradation is inevitable due to the presence of irrelevant frames. Therefore, in the case of scene change, we swap the unrelated frames with the temporally closest frames of the same scene using the SF subnet introduced in Section 3.2. It improves the quality of the output frames significantly. Fig. 8 shows the qualitative and quantitave results of our 3DSRnet with and without frame replacement for input with scene change. As seen in Fig. 8 (c), if the disparate frames are replaced with zeros, the performance tends to severely drop.
Let F a series of frames in a sliding time window where the n-th frame is the first frame just after scene change. As illustrated Fig. 8 (c), the current frames of F, F and F correspond to the middle red frames, and those of F, F and F to the yellow frames. of F, and and of F are replaced with of F, and of F, respectively. Similarly, and of F, and of F are replaced with of F, and of F, respectively. Note that F and F do not contain any scene change. As shown in Fig. 8 (c), the PSNR starts to drop from F to F, and from F to F. When the SF subnet is incorporated, the PSNR values without the SF subnet are enhanced by average 0.39, 0.46, 0.32, and 0.25 dB for F, F, F and F, respectively. Table 7 shows the detection accuracy of scene change by the SF subnet architecture with two and three layers. Even with the two-layer SF subnet, the detection accuracy of 99.89% was obtained.
The inference time on an NVIDIA TITAN X GPU is 166 ms and 788 ms for the scale factor of 2 and 4, respectively, to upscale an image of 960540 resolution from the input of five frames.
We propose the 3DSRnet, a video SR method that effectively captures spatio-temporal information of LR input frames in reconstructing HR frames throughout deep 3D convolution layers with temporal depth constantly maintained, all without prior motion alignment. The proposed 3DSRnet employs residual learning with the sub-pixel output structure and prevents severe performance drop due to scene change in the multiple input frames by adopting a simple classification network. The experimental results shows that our proposed 3DSRnet outperformed the most state-of-the-art image and video SR methods by maximum 0.45 dB in PSNR.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1874–1883
In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. (2011) 315–323
Single image super-resolution with dilated convolution based multi-scale information learning inception module.In: Proceedings of the IEEE International Conference on Image Processing. (2017)