Video prediction refers to the problem of generating pixels of future frames given context information in the form of past frames. The problem has attracted a lot of attention in the context of generative video models. The ability to predict the future accurately has applications in various domains including robotics for path planning, self driving cars, anomaly detection[liu2018future] and video compression. It is also shown that solving this problem offers a fundamental approach to learning internal representations of videos [srivastava2015unsupervised, liang2017dual, byeon2018contextvp]. Further, the problem also helps in understanding interactions of physical objects in the real world [finn2016unsupervised, janner2018reasoning]. While researchers have largely focused on the problem of predicting all pixels in future frames [srivastava2015unsupervised, mathieu2016deep], in task specific goals such as predicting object motion due to actions, we might only be interested in predicting relevant features in future frames [agrawal2016learning]. Nevertheless, we believe that the problem of predicting all pixels in future frames allows for rich self-supervision, a visual interpretation of the predicted frames and a more generic approach to learning across different applications. The video prediction problem leads to an important question of how to generically evaluate the realism or naturalness of the predicted videos in a task free viewing condition.
While there exists a rich body of work on video prediction using generative models, the design of methods for evaluating the naturalness or realism of the videos has received much less attention. Simple signal fidelity measures such as mean squared error (MSE) or the structural similarity (SSIM) index [wang2004image] can be computed in scenarios where a reference future video sequence is available. However, for a given context, there might exist a multitude of possible future video trajectories that are natural looking. It would be unfair to compare such predicted videos against a given future realization. This leads to the question of what we really mean by a natural video and how it can be quantified.
The definition of naturalness of predicted videos needs to capture multiple notions. The visual quality of the predicted frames is an important aspect of assessing video naturalness. Indeed, video prediction researchers have identified the sharpness of predicted frames as an important evaluation tool [mathieu2016deep]. Video quality is more complicated than merely evaluating the spatial quality of frames. Object motion and temporal consistency are important elements of video quality and popular no-reference video quality indices seek to model such aspects. The spatial naturalness of video frames is also influenced by the realism of object shapes, texture and consistency of relative positions of different objects. The semantic consistency of predicted videos with logic and physics is also an important aspect of naturalness. In others words, the events unfolding in a video need to make logical sense and also obey the laws of physics of motion. In summary, the notion of naturalness is much more complicated and nuanced when compared to perceptual quality. It appears to involve elements of both early and later stages of human vision systems. The broader question of evaluating the naturalness of any video instead of a predicted video is also important. In this work, we particularly focus on predicted videos given the rich literature on both datasets and generative prediction models.
The main focus of our work is in the subjective and objective study of the naturalness of predicted videos. Recently, small scale subjective studies through two-alternative forced choice (2AFC) experiments on predicted videos and camera captured videos have been carried out to prove the effectiveness of specific video prediction models [lee2018stochastic]. While human opinion might be the best subjective measure of naturalness, collecting such human data is cumbersome and it is desirable to have an objective automatic measure of naturalness that can be evaluated on any video. Instead of a binary certificate of naturalness on predicted videos, we believe a continuous valued measure will be more useful in comparing various prediction methods.
Very recently, the Fréchet video distance (FVD) was introduced to evaluate generative models and validated using a subjective study [unterthiner2019fvd]. The distance is meant to be applied on a collection of generated videos instead of individual videos and is thus different from our goal to measure naturalness. Further, the study is designed primarily to prove the effectiveness of FVD while we seek to design a study that can help benchmark and advance research in measuring naturalness of any predicted video. To the best of our knowledge, there exists no human study on predicted videos that measures naturalness.
I-a Overview of Contributions
Our main contributions in this work are in the creation of a database of predicted videos, design of a subjective study, benchmarking of existing objective methods used to evaluate naturalness and introduction of mechanisms leading to improved prediction of video naturalness. We create the Indian Institute of Science VIdeo Naturalness Evaluation (IISc VINE) Database consisting of 300 videos, each consisting of 20 frames, obtained from a variety of different prediction models [lee2018stochastic, lotter2017deep, villegas2017mcnet, babaeizadeh2018stochastic, denton2018stochastic, liu2018dyan, aigner2019futuregan]. The videos are generated by applying both deterministic and stochastic prediction models on video databases typically used to evaluate them [finn2016unsupervised, schuldt2004recognizing, dollar2011pedestrian, soomro2012ucf101, geiger2013vision, zhang2013actemes, msr2016action, ebert2017self, yu2018bdd100k]. Our database contains a variety of sources of unnaturalness or distortions such as blurred frames, frames with distorted object shapes, temporal color variations and sudden appearance or disappearance of objects as shown in Figure 1. Thus our database is very diverse in terms of content and distortions.
We conduct a subjective study involving 50 human subjects resulting in a total of 6000 video ratings under calibrated conditions. Since the videos from different databases are available at different resolutions and might bias the naturalness scores, we adopt a double stimulus continuous naturalness evaluation method. In our study, a pair of videos is shown, one being the test video and the other, a different natural video from the same or a similar dataset.
We benchmark several popular video quality measures such as MSE, SSIM and deep network based loss functions against the subjective scores of naturalness. We show that these measures do not correlate well with the subjective scores since they are evaluated by assuming a fixed trajectory of the reference. We also show that popular no-reference video QA algorithms do not match well with subjective judgements of naturalness implying that quality and naturalness can be qualitatively different.
Finally, we introduce two novel sets of features to effectively predict the naturalness of predicted videos. The first set of features is based on computing cosine similarities of deep features of past frames with corresponding motion compensated features from the predicted frames. This helps capture object blur, shape and color distortions in a robust fashion by comparing with the past frames. Secondly, we rescale frame differences of adjacent frames of the predicted video to appear like an image and extract corresponding deep features to capture object shape variations in regions containing motion. We show that these features can effectively predict naturalness by achieving state of the art performance in terms of correlation with the subjective scores.
We summarize the main contributions of our work as follows:
We introduce the IISc VINE database of 300 videos predicted using a variety of models and based on multiple datasets.
We conduct a behavioural study with 50 subjects to measure the naturalness of the predicted videos through a double stimulus scoring mechanism.
We benchmark several metrics popularly used in video prediction evaluation and show that existing metrics correlate poorly with human perception of naturalness.
We propose novel features based on motion compensated cosine similarities and rescaled frame differences and show that they are useful in predicting naturalness in a manner that agrees very well with human perception.
The rest of the paper is organized as follows. In Section II, we survey related work. We describe the video naturalness evaluation database and the subjective study in Section III. We introduce our naturalness evaluation features in Section IV. We present detailed experiments and ablation studies in Section V and finally conclude the paper in Section VI.
Ii Related Work
Ii-a Evaluation methods for video prediction and generation models
The most popular method of evaluating predicted video frames is using MSE or the SSIM index [wang2004image]. In a variant of MSE, areas with higher motion are weighted preferentially using optical flow based weights [mathieu2016deep]. Other measures that involve comparison with a reference include squared error [janner2018reasoning] and cosine similarity [lee2018stochastic, kumar2019videoflow] in the pre-trained VGG net [simonyan2015very] feature space. The inception score for images [salimans2016improved] has also been applied to evaluate generated video frames [he2018probabilistic, xu2018video]. The image inception distance has been extended to videos through FVD [unterthiner2019fvd]. In particular, features based on Inflated 3D Convnet are used to compute a distance measure between a set of generated videos and a database of pristine videos. FVD was validated using a human study through pairwise tests on the BAIR dataset [ebert2017self]. Further 2AFC experiments were conducted to evaluate few video prediction models [lee2018stochastic].
Ii-B Video quality assessment
Video quality assessment (VQA) has been studied quite extensively over the last decade or so with the conduct of several studies of subjective quality and the design of successful objective algorithms. Publicly available VQA databases include those containing synthetic distortions such as the LIVE VQA database [seshadrinathan2010study] and EPFL-Polimi dataset [de2010h] or those containing authentic camera captured distortions such as the LIVE Video Quality Challenge (LIVE VQC) Database [sinno2019large] and the KoNViD-1k database [hosu2017konstanz]. VQA algorithms are broadly divided into two categories, full reference (FR) and no reference algorithms (NR). FR VQA algorithms utilize a reference video to predict the quality of a distorted video by exploiting both spatial and temporal similarity. Some examples of successful FR algorithms that exploit spatio-temporal information include MOVIE [seshadrinathan2009motion], ST-MAD [vu2011spatiotemporal] and VMAF [li2018vmaf]. These algorithms operate either by computing spatio-temporal transformations or obtain quality features separately in the spatial and temporal domains and combine them.
The lack of availability of a true reference in several scenarios motivates the design of NR algorithms. The NR VQA problem has been found to be much more challenging than the FR problem and current NR algorithms are not yet as successful as the FR algorithms. Video BLIINDS [saad2014blind], VIIDEO [mittal2016completely] and SACONVA [li2016no] are a few examples that have been able to approach the performance of FR algorithms. Recently, deep neural networks have been used to obtain good performance on authentic distortions [li2019quality]
. Nevertheless, the use of convolutional neural networks to design successful NR VQA algorithms is still a nascent and active area of research.
Ii-C Naturalness in other contexts
The notion of naturalness in other contexts has been studied through visual realism and naturalness of videos of human motion. In [fan2017image], the authors define visual realism of images as a combined measure of familiarity of objects, naturalness of color and illumination. The goal of this work is to distinguish between camera captured photos and computer generated graphics content. The authors in [ren2005data] attempt to quantify naturalness in human motion for applications of synthetic motion. This work is restricted to human motion alone, and to synthetic videos in particular.
Iii Video Naturalness Evaluation Database
We now describe in detail, the IISc VIdeo Naturalness Evaluation (IISc VINE) database, our subjective study and important observations from the study.
The videos in our database are generated by various video prediction algorithms. These video prediction algorithms are trained on a variety of datasets containing human actions, sports videos, vehicle driving and robot pushing videos. In our database, we use a combination of publicly available pre-trained models of different prediction algorithms and also models that we train on other datasets.
Video Prediction Models
: We use a total of seven video prediction models. The models can be broadly classified as deterministic and stochastic. The deterministic models are trained to predict the future frames, exactly as in the ground truth video. The deterministic models we use are PredNet[lotter2017deep], MCnet [villegas2017mcnet], Future GAN [aigner2019futuregan] and DYAN [liu2018dyan]. On the other hand, the stochastic models are based on the premise that the future is uncertain and hence for any given context, there are multiple plausible future trajectories. These models are trained to predict a distribution of possible futures using noise as input. For our database, we select one of the futures predicted by these models. We use videos generated by SAVP [lee2018stochastic], SV2P [babaeizadeh2018stochastic], SVG-LP [denton2018stochastic] and some of their ablation models in our database. Along with the videos predicted by these models, we also include ground truth or natural videos from these datasets in our database. This forms 10% of our database and is helpful to validate various aspects of the study, such as biases due to different resolutions and whether the subjects are able to comprehend the notion of naturalness.
We apply the video prediction models on nine different datasets typically used in their evaluation. These include BAIR [ebert2017self], PUSH [finn2016unsupervised], KTH [schuldt2004recognizing], MSR [msr2016action], UCF-101 [soomro2012ucf101], PENN [zhang2013actemes], KITTI [geiger2013vision], Caltech Pedestrian [dollar2011pedestrian] and BDD100K [yu2018bdd100k]. Among the above datasets, the BAIR robot push dataset is highly stochastic i.e. the movement of the robotic arm given the current frame is random. The other datasets have relatively lower stochasticity as argued in [lee2018stochastic]. For the sake of simplicity, we refer to these datasets as deterministic datasets. The videos in our database include those generated by applying stochastic models on stochastic datasets, stochastic models on deterministic datasets and deterministic models on deterministic datasets. Using the above combinations, we generate a large number of videos. Among them, we select 300 videos to cover different kinds of unnaturalness at varying levels. Table I shows the number of videos taken from each dataset.
Distortions: We observe a variety of sources of unnaturalness due to different video prediction algorithms. The loss of naturalness is primarily seen in the form of blurred frames or distorted object shapes. The use of pixel level loss measures such as mean squared error in training video prediction algorithms can lead to blurred frames [lotter2017deep] as shown in Figure 1a. We observe that algorithms trained using adversarial loss functions [villegas2017mcnet, aigner2019futuregan], result in distortions of object shapes in frames further into the future as shown in Figure 1b. This primarily occurs in objects with reasonable motion. We also notice the sudden appearance or disappearance of object defying logic as shown in Figure 1c. Occasionally, we observe inexplicable color variations during the video trajectory that look unnatural as shown in Figure 1d.
Further, we see different kinds of shape distortions such as deformations (Figure 2a), splitting (Figure 2b) and elongations of objects (Figure 2d). In some videos, we witness a combination of shape distortions with object disappearance (Figure 2c). We note that shape distortions are highly localized, while the rest of the video frame looks completely natural. This renders the problem of predicting naturalness in such scenarios very challenging.
Video Resolution and Duration: Since different video prediction models available in literature are trained to generate videos at different resolutions, the videos in our database are of varying resolutions. The resolutions include 64x64, 128x128, 160x128, and 320x240. We discuss the implications of this aspect of the database and the normalization required while conducting the subjective study in Section III-B. All videos generated by the prediction algorithms have 4 context frames and 16 predicted frames. Following [lee2018stochastic], where a small scale subjective evaluation (2AFC experiment) was conducted, we use a frame rate of 4fps for all the videos. Thus, each video is of duration 5 seconds during playback.
Iii-B Subjective Study
We conduct a subjective study to assess the naturalness of the predicted videos. Since the subjective evaluation of naturalness of predicted videos has not been studied before and it is not clear apriori how humans would respond to the task of assessing naturalness, we conduct the study in a controlled lab environment. Our study provides a platform to evaluate existing metrics and help design newer measures with better perceptual correlation. In our study, 50 subjects participated under calibrated viewing conditions and all the subjects viewed the videos on a 24 inch LED monitor. Each subject rated a total of 120 videos, 60 each in two sessions, each session lasting around half an hour and separated by a minimum of 24 hours. For each subject, the videos were presented in a random sequence. Each video is rated by an equal number of subjects. Since there are 300 videos in our database, we obtain a total of 20 human scores for each video.
Since it is difficult to perceptually understand the lower resolution videos in our database, such videos are upsampled using bicubic interpolation and shown during the subjective study. In order to remove any biases in the scoring of such upsampled videos, we employ a double stimulus continuous naturalness evaluation scoring mechanism. Here, a reference video with similar content at the same resolution as the evaluation video is also upsampled and shown on the left while the evaluation video is shown on the right. The subjects are asked to rate the naturalness of the evaluation video on a scale between 0 and 100 assuming that the reference video shown would correspond to a score of 100. We show in SectionIII-C that such upsampling does not bias the naturalness scores of the upsampled videos.
Since most of the videos in the database show a degradation of naturalness with time, we asked the subjects to take into account the entire 5s duration video and provide a single holistic score of the naturalness. The videos are looped continuously and the subjects can view them as long as desired before providing a rating on a continuous scale that appears at the bottom of the screen. Every subject is shown 6 videos prior to the start of the study in each session. This allows the subject to get a sense of the range of naturalness levels and different kinds of loss of naturalness in the database.
Processing of Subjective Scores: We process the collected subjective scores to obtain a mean opinion score (MOS) of naturalness for every video following well established procedures in VQA [seshadrinathan2010study]
. In particular we subtract the mean and standard deviation of the scores of each subject in each viewing session to obtain ‘Z-scores’. We then apply the subject rejection procedure outlined in ITU-R BT 500.11 recommendation[itu2002methodology]
to remove the outlier subjects. In our study, we found 7 out of 50 subjects to be outliers. The scores from the inlier subjects are then rescaled linearly to lie between 0 and 100 and the MOS for every video is computed as the average Z-score (after rescaling) of every video across all subjects who rated that video. Figure3 shows the distribution of MOS where we see that more than 90% of the scores lie in the range [30,80]. Such a distribution of scores presents a challenging test condition for naturalness evaluation methods. In Figure 3, we observe a small peak around MOS value of 75. This peak is due to the presence of natural videos in our database.
Iii-C Observations from the Subjective Study
Iii-C1 Consistency of subjects
We check the consistency of the subjective scores of the inlier subjects through the following experiment. We randomly split the inlier subjects into two halves and compute MOS for each video in each half of the population. We then compute the Pearson’s linear correlation coefficient (PLCC) between the MOS coming from each half. Figure 4 shows scatter plot of MOS obtained from each half for one such split, where we observe high correlation between MOS from the two halves. Further we compute median PLCC across 100 random splits of the population, which works out to 0.94. This shows that the subjects are fairly consistent in assessing the naturalness of the videos. This also provides a reasonable upper bound on the correlation with the subjective scores, which we can expect from objective measures of naturalness.
Iii-C2 Validation of our subjective study
We now study the average MOS of the natural videos and predicted videos in Table II. We clearly see that average MOS for natural videos is higher than that of predicted videos. This shows that the subjects are able to comprehend the notion of naturalness.
In order to study the impact of upsampling low resolution videos on the subjective scores, we compare the average MOS of upsampled (for lower resolutions such as ) and non-upsampled videos (with higher resolution ) in Table II
. We conduct this test on natural videos to avoid any bias due to the distortions present in the predicted videos. We observe that the average MOS for the upsampled videos is comparable to that of the videos at their original higher resolutions. In order to verify the statistical indistinguishability of the MOS in each case, we also conduct t-test[casella2002statistical]
at 99% significance level. The null hypothesis is that the mean of the MOS values for both groups are equal and the alternate hypothesis is that the means are different. The-value of the t-test evaluates to and hence the null hypothesis cannot be rejected. Thus we conclude that the upsampled videos do not suffer from any biases in their subjective ratings.
|Experiment Type||No. of Videos||Average MOS|
Iii-C3 How does MOS vary for different distortions?
We investigate the effect of different distortions on human perception. We observe that shape distortions and blur are the two predominant classes of distortions in the predicted videos. We roughly classify the videos into those that contain shape distortions and those that contain blur. Some videos have both distortions in which case they are marked under both categories. The resulting MOS for the two classes of videos is shown in Table II. We find that the average MOS for videos with blur is roughly equal to the average MOS for videos with shape distortion.
The use of adversarial loss functions in training video prediction models gained popularity since the use of MSE as a loss function leads to blurred predictions. However, we observe that use of adversarial loss functions leads to shape distortions which can also reduce the MOS. Adversarial loss functions tend to measure global consistency with a database of natural videos and localized shape distortions may not be captured even though they appear to be perceptually annoying. Since the MOS for both kinds of distortions is roughly equal, we believe that adversarial loss functions may not be helping improve the overall naturalness.
Iii-C4 Do stochastic models perform better than deterministic models?
We seek to understand whether modeling of the stochasticity of future trajectories in video prediction, affects the naturalness of the predicted video. As we pointed out earlier, deterministic methods [mathieu2016deep, villegas2017mcnet] pick only one of the multiple plausible trajectories. On the other hand, stochastic approaches train the model to predict multiple future trajectories [lee2018stochastic, babaeizadeh2018stochastic, denton2018stochastic]. Table II shows the average MOS and standard deviation with respect to the two methods described above. We see that the average MOS is lower for deterministic methods when compared to stochastic models. We also verify the statistical significance of this observation using t-test [casella2002statistical] at 99% significance level. The null hypothesis is that the mean MOS scores of the two groups are equal and the alternate hypothesis is that the mean MOS scores of stochastically predicted videos is higher than that of deterministically predicted videos. The -value of the t-test evaluates to and hence the null hypothesis can be rejected. Thus, we can conclude that the ability of stochastic models to better capture the uncertainty in the future trajectories, allows them to generate more natural looking videos.
Iv Deep Feature Processing for Video Naturalness Evaluation
We now present two sets of features that are particularly relevant in reliably predicting naturalness of predicted videos. The first set of features is motivated by the observation that objects in a scene are well represented in the past frames and can be used to measure how representations evolve in future predicted frames. Thus we exploit the rich information available in the deep features of objects in the past frames and make motion compensated comparisons of deep features in predicted frames. We capture this idea through motion compensated cosine similarity based features. This feature also helps identify the disappearance or vanishing of objects suddenly from the middle of a scene. Secondly, we observe that most of the abnormalities in predicted videos occur in regions of motion. In order to capture variations in representations in moving regions and also more carefully measure distortions in object shapes, we introduce the notion of rescaled frame differences and compute deep features from such images. We provide further details of both features in the following subsections.
Iv-a Motion-compensated Cosine Similarity (MCS) features
We now describe the computation of the motion compensated cosine similarity between the deep features of the last context frame and motion compensated features of predicted frames as illustrated in Figure 5b.
We experiment with different networks to obtain deep features such as VGG-19, ResNet-50 and Inception-v3 and refer to one such network in the following. Let be the total number of frames, be the number of context frames and be the number of predicted frames. Thus . Let be the number of channels in the pretrained model, at the layer where we tap the features. Let and be the height and width of the corresponding feature map.
Let denote the deep feature at location in Channel in Frame , where , , and
. The cosine similarity between two vectorsand be defined as
where denotes the two-norm of the vector. Let denote a vector of deep features across channels at location in Frame . For a given feature in Frame , the corresponding motion compensated feature in Frame with is obtained as
In other words, for every location in the context frame, we determine the location in the predicted frame with the best cosine similarity in the feature space. Thus we obtain the motion compensated features in each predicted frame and compute the MCS feature in Frame and Channel as
where denotes the vectorized deep features across spatial locations in Frame and Channel and is also defined similarly. This gives us a dimensional MCS feature vector per frame. We concatenate the MCS features from all predicted frames to get a dimensional feature vector.
The MCS features are important in capturing several aspects such as object blur, distortion of shapes, abnormal disappearance of objects from the middle of a scene and change in object color. We believe that the natural disappearance of objects from scenes (such as objects moving out of the field of view) can be distinguished from unnatural ones by observing the trajectory of MCS features across frames. However, we observe that the occurrence of such events is relatively less likely owing to the limited future duration over which video prediction occurs.
|MSE||0.4044 0.11||0.6578 0.08||10.2556 0.86|
|SSIM [wang2004image]||0.5274 0.09||0.6828 0.07||09.9311 0.89|
|MS-SSIM [wang2003multiscale]||0.5207 0.09||0.6575 0.08||10.2248 0.88|
|Gradient Difference [mathieu2016deep]||0.4908 0.10||0.6838 0.07||10.8074 1.04|
|VGG-19 MSE||0.5364 0.08||0.6403 0.07||11.4350 0.97|
|VGG-19 cosine similarity||0.6404 0.08||0.7506 0.06||08.9538 0.72|
|ST-MAD [vu2011spatiotemporal]||0.3730 0.12||0.6516 0.08||10.3446 0.88|
|VMAF [netflix2020vmaf]||0.6003 0.09||0.7462 0.06||09.3609 0.73|
|BRISQUE [mittal2012no]||0.0905 0.11||0.0942 0.11||13.8893 1.27|
|NIQE [mittal2013making]||0.0819 0.12||0.0698 0.12||15.6844 1.09|
|Inception Score (Entropy of Conditional only)||0.0828 0.11||0.0458 0.10||15.4043 1.22|
|Video BLIINDS [saad2014blind]||0.4072 0.10||0.6200 0.10||12.4202 1.14|
|Li et al. [li2019quality]||0.6371 0.09||0.6504 0.08||10.7497 1.12|
|Baseline - SSA features - 3D ConvNet||0.4592 0.09||0.5042 0.11||12.5282 1.73|
|Baseline - SSA features - ResNet-50||0.7188 0.06||0.7246 0.06||09.4145 0.86|
|Our Model - VGG-19||0.7418 0.06||0.8132 0.05||07.8710 0.90|
|Our Model - Inception-v3||0.7922 0.06||0.8398 0.04||07.4590 0.87|
|Our Model - ResNet-50||0.8304 0.04||0.8613 0.03||06.7791 0.78|
Iv-B Rescaled Frame Difference (RFD) features
The second set of features we design is based on our observation that shape distortions are highly localized in regions containing motion. While optical flow may be used to determine motion masked frames as in [mathieu2016deep]
, the flow estimates tend to be noisy in predicted videos which contain a variety of artifacts. In order to overcome this challenge, we resort to measuring frame differences between adjacent frames to capture moving regions. However, instead of using such information to mask frames, we rescale the frame differences in the intensity range [0,255] for each color channel and extract deep features from such images. The deep features (from VGG-19, ResNet-50 or Inception-v3) of rescaled frame differences enable robust measurement of shape distortions as argued below.
In Figure 6, we show examples of rescaled frame differences of two predicted videos from our database. We observe that the rescaled frame differences, simultaneously capture both the moving regions of frames as well as the changing contours of moving objects. We believe that the visualization of changing contours of moving objects in RFD adds robustness in the design of features along with MCS. We note that RFD resemble sketch images [eitz2012humans] in the manner in which object outlines are visible. Motivated by the success of deep ResNet features in sketch recognition applications [zou2018sketchy], we extract similar features from RFD. We spatially average the deep features from each RFD to get a single feature per channel and then we concatenate the features across all frame differences and channels to get a length feature vector.
In order to further understand the relevance of deep features of RFD, we compare them with deep features of frames. Note that deep features of frames typically capture aspects such as object texture, shape, color and so on [zeiler2014visualizing]. However, we observe that in RFD in Figure 6, color and other local properties tend to get suppressed. Thus, the corresponding deep features are primarily sensitive to the shape of the moving objects. In order to study this more carefully, for the videos in Figure 6, we compare the dissimilarity of spatially averaged deep features of frames and RFD between the first context frame and the last predicted frame. For Video 1, we observe that the dissimilarity score (1 - cosine similarity) for RFD features is 0.34, while that of frame features is 0.16. For Video 2, the corresponding scores are 0.43 and 0.27 respectively. This illustrates that the deep features of RFD are more sensitive to variations in object shapes when compared with the features of the frames themselves.
Iv-C Learning naturalness from features
We process the MCS and RFD features separately using different intermediate fully connected (FC) layers of dimension . We then concatenate the output of these layers and use a final FC layer to predict the naturalness score. The high level architecture of our framework is illustrated in Figure 5a. All the videos in our database consist of 4 context frames and 16 predicted frames leading to a total of 20 frames. Thus, we get . Further, we choose
. We train the network with mean squared error loss and Adam optimizer with a learning rate of 0.001 for 200 epochs.
V-a Evaluation of Objective Naturalness Measures
We present the evaluation of various measures of naturalness, spanning FR and NR image and video QA indices, existing measures of naturalness, deep features of spatial and spatio-temporal networks and finally our feature design contributions.
V-A1 Existing measures of naturalness
Several QA indices are popularly used to measure video naturalness. Among FR image QA metrics, we evaluate MSE, SSIM [wang2004image], MS-SSIM [wang2003multiscale] and gradient difference [mathieu2016deep]. We also evaluate MSE and cosine similarity in the VGG feature space [lee2018stochastic, kumar2019videoflow]
by tapping the features from the fourth convolutional layer of the fifth block (20th layer in Keras model) of the VGG-19 network[ledig2017photo].
Among NR image QA indices, we evaluate BRISQUE [mittal2012no] and NIQE [mittal2013making] by computing them on each frame and taking their average. We also evaluate a modified version of Inception Score [salimans2016improved] that can be applied on individual frames. The Inception Score evaluates both the quality of the generated image as well as the whether the generated images match the distribution of a given dataset. Here we compute the entropy of the conditional distribution only, as a measure of the naturalness of individual frames and average them.
Among video QA measures, we evaluate FR measures such as ST-MAD [vu2011spatiotemporal] and VMAF v1.5.1 [netflix2020vmaf] and NR indices such as Video BLIINDS [saad2014blind] and the measure by Li et al. [li2019quality]. We train VMAF and both the NR measures on our naturalness database for a fair comparison.
V-A2 Naturalness evaluation using deep features
We present a simple baseline by processing the features extracted from ResNet-50 [he2016deep]
model, pre-trained on the ImageNet-1k[russakovsky2015imagenet] image classification database. We tap the features before the global pooling operation, apply simple spatial averaging (SSA) to get a feature vector of dimension per frame. We then concatenate the features from each frame and feed them to a learning network (consisting of FC layers), similar to our model in Section IV-C.
Additionally we present another baseline, using features from the pre-trained 3D ConvNet (C3D) model [tran2015learning], successfully used in action recognition on videos. We resize the input frames to a resolution of 112x112, tap spatio-temporal features before the last pooling layer, and process them through FC layers as described above. While ResNet-50 is trained on images, C3D is directly trained on videos.
V-A3 Our model
We evaluate our model for naturalness evaluation based on MCS and RFD features using different networks such as VGG-19 [simonyan2015very], ResNet-50 [he2016deep] and Inception-v3 [szegedy2016rethinking], all of which are pre-trained on the ImageNet-1k [russakovsky2015imagenet] image classification database. We tap features from the last convolutional layer before the FC layers. This results in a choice of for VGG-19, ResNet-50 and Inception-v3 networks respectively.
We use the pre-trained models provided by Keras python package, which is now a part of tensorflow library. We note that the weights of pretrained models are updated in newer versions of the library and hence the values quoted in this paper may differ with different versions of tensorflow package. For our experiments, we use version 2.0 of tensorflow package.
V-A4 Performance Evaluation
We evaluate the different naturalness indices using Spearman Rank Order Correlation Coefficient (SROCC), Pearson linear correlation coefficient (PLCC) and root mean squared error (RMSE) popularly used in the QA literature [seshadrinathan2010study]. In order to evaluate PLCC and RMSE, a non-linear function is fitted to predict the MOS from the objective scores for objective measures that are not trained on our database. All the results are obtained by splitting the dataset into training and testing in the ratio 80:20 over 100 iterations and computing the median performance. For measures that require no training on our database, for a fair comparison, we evaluate the performance measures in the corresponding test sets of each iteration.
The results of our experiments are presented in Table III. We only show the magnitude of PLCC and SROCC in the table. We see that among the FR measures, VGG-19 cosine similarity achieves the best performance in terms of correlation with the subjective scores. We believe that the normalization implicit in the computation of the cosine similarity makes it perform better than VGG-19 MSE. We notice similar performance of SSIM and MS-SSIM measures, perhaps due to the lower resolution of videos in our database. NR image QA indices and Inception Score seem to correlate poorly with human perception while Video BLIINDS performs better than these indices.
On the other hand, deep features of pre-trained networks extracted from video frames tend to achieve better performance. In particular they outperform Video BLIINDS and the model in [li2019quality], which are also trained on our database. We believe that the superior performance of deep features over QA methods is due to their ability to extract high level features in contrast to QA methods which typically employ low level features. We note that the poor performance of the Conv 3D model may be attributed to the training of this model on action recognition. Thus, the resulting features may not capture the spatial distortions in video frames. Finally, we observe that our model based on MCS and RFD features performs significantly better than all measures of naturalness. We see an improved performance in terms of all evaluation measures. The lower standard deviation across splits in the performance numbers when compared to other methods also suggests that our model consistently achieves excellent performance across splits.
V-B Ablations and Extended Experiments
V-B1 Contribution of individual components
Since our model involves two components, the MCS and RFD features, we study the impact of each of the components in Figure 7. We perform this experiment on our model trained on ResNet-50 features, which achieved the best performance. We note that RFD features perform better than frame features. Further, we see that the combination of the MCS and RFD features leads to a significant improvement in the performance. Finally, we note that the MCS features are more useful than the spatial averaged deep features when combined with the RFD features and in Section V-B2 we show that MCS features perform better than SSA features with limited training data.
V-B2 Robustness with less training data
We also evaluate the robustness of our model with respect to the amount of training data. For a given split of the dataset into training and testing in the ratio 80:20, we build a series of training sets starting with 10% of the videos and adding 10% more videos in each step. We then evaluate the performance of our model when trained with these subsets as shown in Figure 8. We note that the test data is kept constant across all steps and in each step the scores are computed as the median performance across 100 splits. For comparison, we also show the performance of other benchmarks and baselines. We observe that our model trained with just 10% of videos in our database, outperforms all existing measures of naturalness. Note that the VGG-19 cosine similarity achieves a constant performance as it is not a training based algorithm. Further, we note that our model consistently performs better than other models as the amount of training data increases.
V-B3 Performance on stochastic videos
We now present a couple of examples to support our argument in Section I that the inherent stochasticity of future may reduce the efficiency of full reference measures. In Figure 9, we show two examples of ground truth and predicted videos, along with the scores of various full reference measures and our model. In Predicted Video 1, we see the disappearance of the robotic arm, which is highly unnatural. The movement of the robotic arm in Predicted Video 2 is completely natural, just that it is different from Ground Truth 2. From the scores shown, we see that all full reference measures fail to capture the naturalness of videos by indicating that Predicted Video 1 is more natural than Predicted Video 2, where as our model is consistent with human opinion. Further we evaluate various naturalness measures on stochastically predicted videos of our database in Table IV. We observe that the performance of the full reference measures is much poorer than the no reference measures. Thus we conclude that no reference measures are better equipped to measure naturalness than full reference measures.
|VGG-19 cosine similarity||0.4549|
|Li et al. [li2019quality]||0.7165|
|Baseline (SSA) - ResNet-50||0.7077|
|Our Model - ResNet-50||0.7912|
We build a naturalness evaluation database for video prediction models. Our subjective study and benchmarking experiments reveal that current measures of naturalness do not correlate very well with human perception. We show that the MCS and RFD features we introduce can capture naturalness of predicted videos very well and outperform all the existing measures of naturalness. We believe that our database will be particularly useful in further research in this area and help design improved models for video prediction.
Our work in establishing that naturalness can be assessed reliably by human subjects sets the stage for much larger human studies on more videos potentially using crowd sourcing. We largely focused on predicted videos based on generative models. It will be of interest to study the naturalness of other synthetically generated videos in gaming scenarios. Moreover, it will be interesting to understand the role of physics engines in video prediction and naturalness evaluation [janner2018reasoning]. Finally, we primarily looked at a supervised setting by learning naturalness from human scores. It will also interesting to explore unsupervised measures of video naturalness that can be designed by merely having access to a large corpus of natural videos.
The authors would like to thank all the volunteers who took part in the subjective study.