This repo contains source code and materials for the TEmporally COherent GAN SIGGRAPH project.
Adversarial training has been highly successful in the context of image super-resolution. It was demonstrated to yield realistic and highly detailed results. Despite this success, many state-of-the-art methods for video super-resolution still favor simpler norms such as L_2 over adversarial loss functions. This is caused by the fact that the averaging nature of direct vector norms as loss functions leads to temporal smoothness. The lack of spatial detail means temporal coherence is easily established. In our work, we instead propose an adversarial training for video super-resolution that leads to temporally coherent solutions without sacrificing spatial detail. In our generator, we use a recurrent, residual framework that naturally encourages temporal consistency. For adversarial training, we propose a novel spatio-temporal discriminator in combination with motion compensation to guarantee photo-realistic and temporally coherent details in the results. We additionally identify a class of temporal artifacts in these recurrent networks, and propose a novel Ping-Pong loss to remove them. Quantifying the temporal coherence for image super-resolution tasks has also not been addressed previously. We propose a first set of metrics to evaluate the accuracy as well as the perceptual quality of the temporal evolution, and we demonstrate that our method outperforms previous work by yielding realistic and detailed images with natural temporal changes.READ FULL TEXT VIEW PDF
With the advent of perceptual loss functions, new possibilities in
We propose a temporally coherent generative model addressing the
Recently, learning-based models have enhanced the performance of single-...
Single image super-resolution (SISR) is an ill-posed problem with an
Super-resolution using deep neural networks typically relies on highly
We propose a novel method to up-sample volumetric functions with generat...
In this paper, we present a novel network for high resolution video
This repo contains source code and materials for the TEmporally COherent GAN SIGGRAPH project.
This is a fork of the TecoGAN project (https://github.com/thunil/TecoGAN) that adds support for docker.
Pytorch implementation of the TecoGan video super resolution model.
Super-resolution for natural images is a classic and difficult problem in the field of image and video processing. For single image super-resolution (SISR), deep learning based methods achieve state-of-the-art peak signal-to-noise ratios (PSNR), while architectures based on Generative Adversarial Networks (GANs) achieve major improvements in terms of perceptual quality. Several studies have demonstrated[2, 38] that evaluating super-resolution tasks with traditional as well as perceptual metrics is crucial, since there is an inherent trade-off between accuracy in terms of vector norms, i.e., PSNR, and perceptual quality. In practice, the combination of both is required to achieve high quality results.
In video super-resolution (VSR), existing methods still pre-dominantly use standard losses such as the mean squared error instead of adversarial ones. Consequently, results have so far only been evaluated according to metrics based on vector norms, e.g., PSNR and Structural Similarity (SSIM) metrics . Compared to SISR, the major challenge in VSR is to obtain sharp results that do not exhibit un-natural changes in the form of flickering artifacts. Based on mean squared losses, recent VSR tasks improve temporal coherence either by using multiple frames from the low-res input , or by re-using previously generated results .
Although adversarial training can improve perceptual quality of single images, it is not commonly used for videos. In the case of video sequences, we are not only interested in arbitrary natural details, but rather those that can be generated in a stable manner over the course of potentially long image sequences. In our work, we propose a first method for adversarial recurrent training that supervises both spatial high-frequency details, as well as temporal relationships. With no ground truth motion available, the spatio-temporal adversarial loss and the recurrent structure enable our model to generate photo-realistic details while keeping the generated structures coherent from frame to frame. We also identify a new form of mode collapse that recurrent architectures with adversarial losses are prone to, and propose a bi-directional loss to remove the corresponding artifacts.
Our central contributions can be summarized as:
A first spatio-temporal discriminator for realistic and coherent video super-resolution,
A novel “Ping-Pong” loss to tackle recurrent artifacts,
A detailed evaluation in terms of spatial detail as well as temporal coherence.
We also introduce new metrics for quantifying temporal coherence based on motion estimation and perceptual distance.
Additionally, we make the following modifications to established procedures in order to improve the training process: a residual learning instead of direct inference of the image content; using feature maps of pre-trained models, as well as those of the spatio-temporal discriminator during training, as a supplemental perceptual loss; in addition to a data augmentation with single-frame motions. In combination, these contributions lead to videos that outperform previous work in terms of temporally-coherent detail, which we can quantify thanks to the proposed temporal coherence metrics.
, neural networks achieve state-of-the-art performance for both PSNR and SSIM metrics. Specifically, Kim et al.
found that starting with bi-cubic interpolation, and learning the residual content reduces the workload for the neural network. Although pixel-wise errors are reduced in these work, they are still not perceptually satisfying compared to real high-resolution images. In particular, they exhibit an undesirable amount of smoothness.
Since the advent of Generative Adversarial Networks , researchers have found that an adversarial loss significantly helps in obtaining realistic high-frequency details [18, 26]. In these works, pretrained VGG networks are also used as perceptual losses to improve the similarity between generated results and references.
VSR tasks not only require realistic details, but also require details that changes naturally over time in coherence with low-resolution content. Recent works improve the temporal coherence by either using multiple low-resolution frames as inputs to generate one high-resolution frame [12, 31, 21, 19], or by recurrently generating from previously estimated high-resolution frames (FRVSR ). Using a recurrent structure has the advantage to enable the re-use of high-frequency details over time, which can improve temporal coherence. However, in conjunction with adversarial training this recurrent structure gives rise to a special form of temporal mode collapse, as we will explain below.
When using multiple low-resolution frames as input, it becomes important to align these frames, hence motion compensation becomes crucial in VSR. This motion compensation can take various forms, e.g., using variants of optical flow networks [28, 3, 27], and it can be used in conjunction with sub-pixel alignment . Jo et al.  instead used learned up-sampling filters to compute detailed structures without explicit motion compensation.
While VSR methods pre-dominantly use L2 or other standard losses, a concurrent work  also proposed to use an adversarial loss. However, the proposed method focuses on a purely spatial discriminator and employs an L2 loss in time. In contrast, we will demonstrate the importance of a spatio-temporal discriminator architecture and its advantages over direct losses in more detail below.
Similar to VSR, the temporal coherence problem is a very important issue in video style transfer, since small differences in adjacent input frames could cause large differences in the generated outputs. To deal with this issue, several works [10, 25, 11, 5] propose to enforce temporal coherence by minimizing a temporal L2 loss between the current frame and the warped previous frame. However, the averaging nature of an L2 loss effectively prevents the synthesis of detailed structures, and quickly leads to networks that favor smoothness as means to establish temporal consistency. Thus, the L2 metric represents a sub-optimal way to quantify temporal coherence, and better methods have been unavailable so far. We will address this open issue by proposing two improved metrics for temporal coherence.
On the other hand, the tempoGAN architecture , and subsequently also the video-to-video synthesis approach , proposed adversarial temporal losses to achieve consistency over time. While the tempoGAN network employs a second temporal discriminator that receives multiple aligned frames to learn the realism of temporal changes, this approach is not directly applicable to videos: the network relies on a ground truth motion estimate, and generates isolated single frames of output, which leads to sub-optimal results for natural images. Concurrent to our work, the video-to-video method proposed a video discriminator in addition to a standard spatial discriminator, both of which supervise a video sequence generator. While this work also targets temporal coherence, its direction is largely orthogonal to ours. Their architecture focuses on coherent video translation, and could still benefit from our contributions in order to enhance coherence of perceptually synthesized detail.
We propose a recurrent adversarial training for VSR with three components: a recurrent generator, a flow estimation network, and a spatio-temporal discriminator. The generator, , is used to recurrently generate high-resolution video frames from low-resolution inputs. The flow estimation network, , learns the motion compensation between frames to aid both generator as well as the spatio-temporal discriminator . During the training, the generator and the flow estimator are trained together to fool the spatio-temporal discriminator . This discriminator is the central component of our method as it can take into account spatial as well as temporal aspects, and penalize unrealistic temporal discontinuities in the results without excessively smoothing the image content. In this way, is required to generate high-frequency details that are coherent with previous frames. Once trained, the additional complexity of does not play a role, as only the trained models of and are required to infer new super-resolution video outputs.
Our generator network is based on a recurrent convolutional stack in conjunction with a network for motion estimation, similar to previous work .
The generator produces high-resolution (HR) output from low-resolution (LR) frame , and recursively uses the previous generated HR output . In our case, the outputs have four times the resolution of the inputs. The generator can easily re-use details from the previous frame in order to produce coherent images. is trained to estimate the motion between frames and . Although estimated from low-resolution data, this motion, , can be re-sized and used as a motion compensation for the high-resolution frame . The correspondingly warped frame , together with the current low-resolution frame represent the inputs of .
Unlike previous VSR methods, we propose to train our generator to learn the residual content only, which we then add to the bi-cubic interpolated low-resolution input. In line with methods for single image processing , learning the residual makes the training more stable as the network does not overfit to the content of the image but rather focuses on only refining the estimate obtained from bi-cubic interpolation. The high level structure of our generator, also shown in Fig. 1, can be summarized as:
The core novelty of our approach lies in the architecture of the discriminator network. In contrast to previous methods for VSR we propose a discriminator that receives triplets of aligned low- and high-resolution inputs in order to learn a loss function. It is important that this trained loss function can provide the generator with gradient information regarding the realism of spatial detail as well as temporal changes.
The structure of our discriminator is illustrated in Fig. 2 and Eq. (2). It receives two sets of inputs: the ground truth ones and the generated ones. Both sets have the same structure: they contain three adjacent HR frames, three corresponding LR frames with bi-cubic up-sampling, and three warped HR frames. For the ground truth inputs, the regular and warped HR frames contain natural images, while the generated input sets contain images generated by . We denote these inputs as and , with
In this way, the discriminator will penalize the generator if contains less spatial details or unrealistic artifacts compared to . Here, plays the role of a conditional input. At the same time, temporal relationships between should match those of . By applying motion estimation on nearby frames, the warped inputs and
are typically better aligned, which simplifies the discriminator’s task to classify realistic and unnatural changes of the input data over time.also receives the original HR images, such that it can fall back to the original ones for classifying situations where the motion estimation turns out to be unreliable.
By taking both spatial and temporal inputs into consideration, our discriminator balances the spatial and temporal aspects automatically, avoiding inconsistent sharpness as well as overly smooth results. We will demonstrate below that it is crucial that the discriminator receives information over time. Also, compared to GANs using multiple discriminators, this single spatio-temporal discriminator leads to smaller network size, and removes the need for manual weight factors of the spatial and temporal terms.
In the following we explain the different components of the loss functions for the , , and networks.
The recurrent structure for the generative networks in conjunction with the learned discriminator loss functions are susceptible to a special form of temporal mode collapse: they easily converge towards strongly reinforcing spatial details over longer periods of time, especially along directions of motion. This typically severely degrades the quality of the generated images, an example is shown in Fig. 4 a). We have noticed these artifacts in a variety of recurrent architectures. They are especially pronounced in conjunction with adversarial training, and are typically smoothed out by L2 losses in conjunction with high-frequency content .
To remove this undesirable long-term drifting of details, we propose a novel loss function which we will refer to as “Ping-Pong” (PP) loss in the following. For natural videos, a sequence with forward order () as well as its reversed counterpart () represent meaningful video sequences. For a generated frame we can thus impose the constraint that it should be identical irrespective of the ordering of the inputs, i.e., the forward result , and the one generated from the reversed sequence, , , should be identical. Based on this observation we train our networks with extended sequences that have a ping-pong ordering, as shown in Fig. 3. I.e., a reverse version appended at the end, and constrain the generated outputs from both “legs” to be the same. This PP loss term can be formulated as:
Note that in contrast to the generator loss, the norm is the correct choice here. We are not faced with multi-modal data where an norm would lead to undesirable averaging, but rather aim to constrain the generator to its own, unique version over time. The PP terms provide constraints for short term consistency via , while terms such as prevent on long-term drifts of the results.
This PP loss successfully removes the drifting artifacts, as shown in Fig. 4 b), and leads to an improved temporal coherence. In addition, this loss construction effectively increases the size of the training data set, and as such represents a form of data augmentation.
Both pre-trained NNs [7, 13, 34] as well as discriminators during training  were successfully used as perceptual metrics in previous work. In our work, we use feature maps from a pre-trained VGG-19 network , as well as itself. By reducing the distance between feature maps of generated results and ground truth data, our generator is encouraged to produce features that are similar to the ground truth videos. In this way, better perceptual quality can be achieved.
The generator and motion estimator are trained together with a mean squared loss w.r.t. the ground truth data, the adversarial losses and feature space losses from , perceptual losses of VGG-19, the PP loss , and a warping loss :
where again denotes generated samples, and ground truth images. stands for feature maps from VGG-19 or . We use a standard discriminator loss to train :
Our training data-set consists of 250 short HR videos, each with 120 frames and varying resolutions of and upwards. In line with other VSR projects, our ground truth data is generated by down-sampling original videos by a factor of 2, and our LR inputs are generated by applying a Gaussian blur with and then sampling every 4th pixel. We use sequences with a length of 10 and a batch size of 4 during training. I.e., one batch contains 40 frames, and with the PP loss formulation, the NN receives gradients from 76 frames in total for every training iteration.
Besides flipping and cropping, we also augment our data by translating a single frame over time. The temporal relationship in such augmented sequences is simpler due to the static content. We found that this form of augmentation helps the NNs to improve temporal coherence of the outputs. During training, the HR video frames are cropped into patches of size and a black image is used as the first previous frame of each video sequence.
To improve the stability of the adversarial training, we pre-train G and with a simple loss of for 500k batches, before enabling the additional loss terms. During adversarial training we strengthen the generator by training it with two iterations for every training iteration of the discriminator. We use 900k batches for the adversarial training stage, in which is correspondingly trained with 450k batches. All training parameters and details of our NN structures can be found in the appendix. Source code will be published upon acceptance.
In the following, we illustrate the effects of individual loss terms in in an ablation study. While we have included temporal profiles  to indicate temporal coherence of results, we refer the reader to the supplemental video, which more clearly shows the differences between the methods.
Below we compare different variants of our TecoGAN model to EnhanceNet (ENet) , FRVSR , and DUF . ENet is a state-of-the-art representative of photo-realistic SISR methods, while FRVSR represents VSR methods without adversarial or perceptual losses. DUF, on the other hand, represents specialized techniques for temporally coherent detail generation.
We train several variants of our TecoGAN model: first, we train a DsOnly model, that trains and with a regular spatial discriminator only, in addition to the VGG-19 perceptual losses. Compared to ENet, which exhibits strong incoherence due to its lack of temporal constraints, DsOnly shows improvements in terms of temporal coherence thanks to its recurrent architecture, but there are noticeable high-frequency changes between frames. The temporal profiles of DsOnly in Fig. 5 correspondingly contain sharp and broken lines.
We then add a temporal discriminator in addition to the spatial one (DsDt). With two cooperating discriminators, this DsDt version generates more coherent results, and the resulting temporal profiles are sharp and coherent. However, this version often produces the drifting artifacts discussed in Sec. 3.2.1. Our intuition here is that these artifacts can help the generator to fool with their sharpness, while trivially fulfilling the temporal coherent requirements for .
By adding our PP loss , we arrive at the DsDt+PP model, which effectively suppresses these drifting artifacts, and also demonstrates an improved temporal coherence. In Fig. 6, DsDt+PP results in smooth yet detailed temporal profiles without streaks from temporal drifting.
Although this DsDt+PP version generates good results, it is difficult in practice to balance the generator and the two discriminators. The results shown here were achieved only after numerous runs manually tuning the discriminator weights. By using the proposed discriminator instead , we get a first complete model for our method, denoted as . The single discriminator leads to a significant reduction in resource usage. Using two discriminators requires ca. 70% more GPU memory, and leads to a reduced training performance by ca. 20%. The model yields similar perceptual and temporal quality to DsDt+PP with a significantly faster and more stable training.
Since the model requires less training resources, we also trained a larger generator with 50% more weights. In the following we will focus on this larger single-discriminator architecture with PP loss as our full TecoGAN model. Compared to the model, it is able to generate more spatial details, and its training process is more stable, indicating that the larger generator and the single-discriminator are more evenly balanced. Result images and temporal profiles are shown in Fig. 5 and Fig. 6.
Trained with pixel-wise vector norms, FRVSR and DUF show coherent but blurry temporal profiles. Their results also contain fewer high-frequency details. It is worth noting that the DUF model requires a relatively large number of weights (6.2 million). In contrast, our TecoGAN model generates coherent detail with a model size of 3.0 million weights.
While the visual results discussed above provide a first indicator of the quality our model achieves, quantitative evaluations are crucial for automated evaluations across larger numbers of samples. Below we present evaluations of the different models w.r.t. established spatial metrics, and we propose two novel temporal metrics to quantify temporal coherence.
As inherent disagreements between pixel-wise distances and perceptual quality for super-resolution tasks have been established [2, 1]. we evaluate all methods with PSNR, a widely used pixel-wise accuracy metric, as well as the commonly used NIQE metric , which works on single images without reference. Additionally, we found the human-calibrated LPIPS metric  to be very useful in our setting, as we have a ground truth reference for all low-resolution inputs. While higher PSNR values indicate a better pixel-wise accuracy, lower NIQE and LPIPS values represent better perceptual quality.
Mean values for these metrics on the Vid4 scenes  are shown on the left of Table 1. Trained with direct vector norms as loss functions, FRVSR and DUF achieve high PSNR scores. However, the undesirable smoothing induced by these losses manifests themselves in the larger NIQE and LPIPS distances. The lack of small scale detail is identified by both metrics.
ENet, on the other hand, with no information from neighboring frames, yields the lowest PSNR, and achieves the third best NIQE score. The unnatural amount of detail, however, is reflected in a high LPIPS distance that is similar to DUF and FRVSR. With its adversarial training, the TecoGAN model achieves a very good NIQE score, with a PSNR decrease of less than 2dB over DUF. We believe that this slight “distortion” is very reasonable, since PSNR and NIQE were shown to be anti-correlated , especially in regions where the PSNR is very high. Based on this good perceptual quality and a reasonable pixel-wise accuracy, TecoGAN achieves an excellent LPIPS score. Here, our model outperforms all other methods by more than 40%. Additional spatial examples can be found in Fig. 8.
With no ground truth velocity available between frames, evaluating temporal coherence is a very challenging problem. The simple T-diff metric, was used by previous work as a rough assessment of temporal differences . We give corresponding measurements in Table 1 for reference, but due to its local nature, T-diff does not correlate well with visual assessments of temporal coherence.
Instead, we propose a tandem of two metrics to measure the consistence of the generated images over time. First, we consider the similarity of the screen-space motion between a result and the ground truth images. To this end, we compute , where represents an optical flow estimation with Lucas–Kanade . This metric can identify motions that do not correspond with the underlying ground truth, and is more robust than a direct pixel-wise metric as it compares motions instead of image content. We refer to it as tOF in the following. Second, we propose a perceptual distance, tLP, as
. This metric employs the perceptual LPIPS metric to measure the visual similarity of two consecutive frames in comparison to the reference. The behavior of the reference needs to be considered, as the input videos also exhibit a certain natural degree of changes over time. In conjunction, both metrics provide an estimate of the similarity with the ground truth motion (tOF) as well as a perceptual estimate of the changes in the images (tLP). Both aspects are crucial for quantifying realistic temporal coherence. While they could be combined into a single score, we list both measurements separately, as their relative importance could vary in different application settings.
In this way we can quantify the visual appearance of the changes over time, as shown on the right of Table 1. Not surprisingly, the results of ENet show larger errors for all metrics due to their strongly flickering content. Bi-cubic up-sampling, DUF, and FRVSR achieve very low T-diff errors due to their smooth results, representing an easy, but undesirable avenue for achieving coherency. However, the overly smooth changes of the former two are identified by the tLP scores.
While our DsOnly model generates sharper results at the expense of temporal coherence, it still outperforms ENet in terms of temporal coherence. By adding temporal information to discriminators, our DsDt, DsDt+PP, and TecoGAN variants improve in terms of temporal metrics. Especially the full TecoGAN model stands out here. In Fig. 7, we compare all results in terms of temporal metrics (tOF and tLP) and spatial details (LPIPS). The full TecoGAN model performs well in terms of temporal metrics, being on par with DUF and FRVSR, while at the same time outperforming them in terms of spatial detail.
We also conducted a first user study for the Vid4 scenes to check whether our metrics match the human perception of temporal changes. Despite its relatively small scale, the study confirms the ranking that is indicated by our metrics, i.e., our participants favored the TecoGAN videos over the other methods. Details are given in the appendix.
We have presented a novel adversarial approach for video super-resolution that self-supervises in terms of spatial detail as well as temporal coherence. Thanks to the perceptual loss of the adversarial training, our method can generate realistic results with sharp features and fine details. Based on a discriminator architecture that takes into account temporal aspects of the data, in conjunction with a novel loss formulation, the generated detail does not come at the expense of a reduced temporal coherence.
Since temporal metrics can trivially be reduced for smooth and blurred image content, we found it important to evaluate results with a combination of spatial and temporal metrics. We believe that our proposed temporal coherence metrics will make it possible to objectively evaluate and compare future variants and algorithmic improvements.
While our method generates very realistic results for a wide range of natural images, our method can generate sub-optimal details in certain cases, such as under-resolved faces and text. In addition, the interplay of the different loss terms in the non-linear training procedure does not provide a guarantee that all goals are fully reached every time. However, we found our method to be stable over a large number of training runs, and we anticipate that it will provide a very useful basis for a large class of video related tasks for which natural temporal coherence is crucial.
This work was supported by the ERC Starting Grant realFlow (StG-2015-637014), and we would like to thank Kiwon Um for helping with the user studies.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
The unreasonable effectiveness of deep features as a perceptual metric.arXiv preprint, 2018.
Differences in the outputs can be better seen in the supplemental video: https://ge.in.tum.de/download/2019-TecoGAN/TecoGAN.mp4.
In the TecoGAN architecture, detects the temporal relationships between and with the help of the flow estimation network F. However, at the boundary of images, the output of F is usually less accurate due to the lack of reliable neighborhood information. There is a higher chance that objects move into the field of view, or leave suddenly, which significantly affects the images warped with the inferred motion. An example is shown in Fig. 9.
This increases the difficulty for , as it cannot fully rely on the images being aligned via warping. To alleviate this problem, we only use the center region of and as the input of the discriminator, and we reset a boundary of 16 pixels. Thus, for an input resolution of and of , the inner part in size of is left untouched, while the border regions are overwritten with zeros.
The flow estimation network F with the loss should only be trained to support G in reaching the output quality as determined by , but not the other way around. The latter could lead to F networks that confuse with strong distortions of and . In order to avoid the this undesirable case, we stop the gradient back propagation from and to F. In this way, gradients from to F are only back propagated through the generated samples and into the generator network.
In this way can guide G to improve the image content, and F learns to warp the previous frame in accordance with the detail that G can synthesize. However, F does not adjust the motion estimation only to reduce the adversarial loss.
Because temporal relationships between frames in natural videos can be very complex, we employ a data augmentation with single-frames translating over time. In these translating sequences, we use random offset between consecutive HR frames, varying from [-4.0,-4.0] to [4.0,4.0]. For discriminators, this data contains simpler temporal relationships to detect. On the other hand, the data usually results in LR sequences with aliasing and jitter due to down-sampling, which is an important case generators need to learn to overcome. In all our training runs we use 70% video data, and 30% translating sequences.
To verify that our metrics capture the visual assessment of typical users we have conducted a user study.
We conducted the studies on five different methods, namely bi-cubic interpolation, ENet, FRVSR, DUF and our TecoGAN. We use the established 2AFC design [8, 33], i.e., participants have a pair-wise choice, with the ground-truth video shown as reference. One example can be seen in Fig. 10. The videos are synchronized and looped until user made the final decision. With no control to stop videos, users Participants cannot stop or influence the playback, and hence can focus more on the whole video, instead of specific spatial details. Videos positions (left/A or right/B) are randomized.
After collecting 1000 votes from 50 users for every scene, i.e. twice for all possible pairs ( pairs), we compute scores for all models with the Bradley-Terry model. The outcomes for the Vid4 scenes can be seen in Fig. 11 and Table 2.
From the Bradley-Terry scores for the Vid4 scenes we can see that the TecoGAN model performs very well, and achieves the first place in three cases, as well as a second place in the walk scene. The latter is most likely caused by the overall slightly smoother images of the walk scene, in conjunction with the presence of several human faces, where our model can lead to the generation of unexpected details. However, overall the user study shows that users preferred the TecoGAN output over the other two deep-learning methods with a 63.5% probability.
This result also matches with our metric evaluations. Table 4 shows a break-down with all metrics for individual sequences in the Vid4 test set. While TecoGAN achieves spatial (LPIPS) improvements in all scenes, DUF and FRVSR are not far behind in the walk scene. In terms of temporal metrics tOF and tLP, TecoGAN achieves similar or lower scores compared to FRVSR and DUF for calendar, foliage and city scenes. The lower performance of our model for the walk scene is likewise captured by higher tOF and tLP scores. Overall, the metrics confirm the performance of our TecoGAN approach. Additionally, the metrics match the results of the user studies, and indicate that our proposed temporal metrics successfully capture important temporal aspects of human perception.
The Bradley-Terry scores (standard error)
Details of spatial and temporal metric evaluations can be found in Table 4 and Table 3. Corresponding graphs are shown in Fig. 12, Fig. 13 and Fig. 14. In addition, we compare all TecoGAN variants in terms of spatial and temporal aspects in a single graph in Fig. 15. All of these metrics are evaluated on the standard Vid4 scenes (calendar, foliage, city, and walk), as well as on three scenes from the Tears of Steel movie  (room, bridge, and face). In our metric calculations, we follow the procedures of previous work [12, 27]. These following operations aim for making the outputs of all methods comparable, i.e., some of the published image sequences from other works contain fewer frames or have reduced resolutions. For all result images we first exclude spatial borders with a distance of 8 pixels to the image sides, then further shrink borders such that the LR input image is divisible by 8. For spatial metrics, we ignore the first two and the last two frames; and for temporal metrics, we ignore first three and last two frames, as an additional previous frame is required for inference.
In this section, we use the following notation to specify the network architecture: conc
represents the concatenation of two tensors along the channel dimension;stands for the convolution and transposed convolution operation, respectively; “+” denotes element-wise addition; BilinearUp2 up-samples input tensors by a factor of 2 using bi-linear interpolation; BicubicResize4(input) increases the resolution of the input tensor to 4 times higher via bi-cubic up-sampling; is a densely-connected layer, which uses xavier initialization for the kernel weights. Each contains the following operations:
The architecture of our generator G is:
In , there are 10 sequential residual blocks in the generator ( ), while the TecoGAN generator has 16 residual blocks ( ). The spatio-temporal discriminator’s architecture () is:
Discriminators used in our variant models, DsDt, DsDt+PP and DsOnly, have a similar architecture as . They only differ in terms of their inputs. The flow estimation network F has the following architecture:
Here, MaxVel is a constant vector, which scales the network output into the normal velocity range.
|Ds: 1e-3||Ds: 1e-3, Dt: 3e-4||Dst: 1e-3||Dst: 1e-3|
|:1.67e-3, :1.43e-3, :8.33e-4, :2e-5|
|relu22: 3e-5, relu34: 1.4e-6, relu44: 6e-6, relu54: 2e-3|
In the pre-training stage, we train the F and a generator with 10 residual blocks. An ADAM optimizer with is used throughout. The learning rate starts from and decays by 50% every 50k batches until it reaches . This pre-trained model is then used for all TecoGAN variants as initial state.
In the adversarial training stage, all TecoGAN variants are trained with a fixed learning rate of . We found that learning rate decay is not necessary due to the non-saturated GAN loss. The generators in DsOnly, DsDt, DsDt+PP and have 10 residual blocks, whereas the TecoGAN model has 6 additional residual blocks in its generator. Therefore, after loading 10 residual blocks from the pre-trained model, these additional residual blocks are faded in smoothly with a factor of . We found this growing training methodology, first introduced by Growing GAN , to be stable and efficient in our tests.
In DsDt and DsDt+PP, extra parameters are used to balance the two cooperating discriminators properly. Through experiments, we found to be stronger. Therefore, we reduce the learning rate of to in order to keep both discriminators balanced. At the same time, a factor of 0.0003 is used on the temporal adversarial loss to the generator, while the spatial adversarial loss has a factor of 0.001.
During training, input LR video frames are cropped to a size of , and a recurrent length of 10 is used. In all models, the Leaky ReLU operation uses a tangent of 0.2 for the negative half space. Additional training parameters are listed in Table 5.
TecoGAN is implemented in TensorFlow. Generator and discriminator are trained together, but we only need the trained generator for generating new outputs after training, i.e., we can discard the discriminator network. We run all the models with the same Nvidia GeForce GTX 1080Ti GPU with 11G memory. The resulting performance data is given in Table6.
|Methods||Model weights||Time (ms/frame)|
|FRVSR||0.8M (SRNet)+1.7M (F)||94.56|
|0.8M (G)+1.7M (F)||95.00|
|TecoGAN||1.3M (G)+1.7M (F)||101.47|
The model and FRVSR have the same number of weights (843587 in the SRNet, i.e. generator network, and 1.7M in F), and thus show very similar performance characteristics. The larger TecoGAN model with 1286723 weights in the generator is slightly slower than . However, compared with the DUF model, with has more than 6 million weights in total, the TecoGAN performance significantly better thanks to its reduced size.