With the advances of imaging, communications, and internet technologies, public online video-sharing services (e.g., YouTube, Vimeo) have become popular. In such services, a wide range of video content from user-generated amateur videos to professionally produced videos, such as movie trailers and music videos, is uploaded and shared. Today, online video sharing has become the most considerable medium for producing and consuming multimedia content for various purposes, such as fun, information exchange, and promotion.
For the success of video-sharing services, it is important to consider users’ quality of experience (QoE) regarding shared content, as in many other multimedia services. As the first step of maximizing QoE, it is necessary to measure perceptual quality of the online videos. The quality information of videos can be used for valuable service components such as automatic quality adjustment, streaming quality enhancement, and quality-based video recommendation. The most accurate way to measure perceptual video quality is to conduct subjective quality assessment by employing multiple human subjects. However, subjective quality assessment is not feasible for online videos because of a tremendous amount of videos in online video-sharing services. An alternative is objective quality assessment, which uses a model that mimics the human perceptual mechanism.
The traditional objective quality metrics to estimate video quality have two limitations. First, the existence of the reference video, i.e., the pristine video of the given video, is important. In general, objective quality assessment frameworks are classified into three groups: full-reference (FR), reduced-reference (RR), and no-reference (NR). In the cases of FR and RR frameworks, full or partial information about the reference is provided. On the other hand, NR quality assessment does not use any prior information about the reference video, which makes the problem more complicated. In fact, the accuracy of NR objective metrics is usually lower than that of FR and NR metrics. Second, the types of degradation that are dealt with are rather limited. Video quality is affected by a large number of factors, for which the human perceptual mechanism varies significantly. Because of this variability, it is too complicated to consider all different video quality factors in a single objective quality metric. Hence, existing objective quality metrics have considered only a single or a small number of major quality factors involved in production and distribution such as compression artifacts, packet loss artifacts, and random noise, assuming that the original video has perfect quality. This approach has been successful for professionally produced videos.
However, it is doubtful whether the current state-of-the-art approaches for estimating video quality are also suitable for online videos. First, for a given online video, the corresponding reference video is not available in most cases, where NR assessment is the only option for objective quality evaluation. The performance of existing NR metrics is still unsatisfactory, which makes the quality assessment of online videos very challenging. Second, online video-sharing services cover an extremely wide range of videos. There are two types of videos in online video-sharing services: professional and amateur. Professional videos, which are typically created by professional video makers, and amateur videos, which are created by general users, are significantly different in various aspects such as content and production and editing styles. In particular, user-generated videos have large variations in these characteristics, so they have wide ranges of popularity, user preference, and quality. Moreover, diverse quality factors are involved in online user-generated videos (see Section II for further details). However, existing NR metrics have been developed to work only for certain types of distortion due to compression, transmission error, random noise, etc. Therefore, it is not guaranteed that the existing NR metrics will perform well on those videos.
Online videos are usually accompanied with additional information, called metadata, including the title, description, viewcount, rating (e.g., like and dislike), and comments. Some of the metadata of an online video (e.g., the spatial resolution and title) reflect the characteristics of the video signal itself, while other metadata, including the viewcount or comments, provide information about the popularity of and users’ preference for the video. These types of information have the potential to be used as hints about the quality of the video because the quality is one of the factors that affects the perceptual preference of viewers. Therefore, they can be useful for the quality assessment of online user-generated videos by replacing or being used in combination with objective quality metrics.
This paper deals with the issue of evaluating the perceptual quality of online user-generated videos. The research questions considered are:
Are there any noteworthy patterns regarding viewers’ judgment of the relative quality of user-generated videos?
How well do existing state-of-the-art NR objective quality metrics perform for user-generated videos?
To what extent are metadata-based metrics useful for the perceptual quality estimation of user-generated videos?
What makes the signal-based or metadata-based quality estimation of user-generated videos difficult?
To the best of our knowledge, our work is the first attempt to investigate the issue of the perceptual quality assessment of online user-generated videos comprehensively in various aspects. Our contributions can be summarized as follows. First, by examining subjective ratings gathered by crowdsourcing for online user-generated videos, we investigate the viewers’ patterns of perceptual quality evaluation. Second, we analyze the performance of state-of-the-art NR quality assessment algorithms, metadata-driven features, and their combination in perceptual quality estimation. The study aims the efficacy and limitations of the signal-based and metadata-based methods. Finally, based on the experimental results, various issues in the quality assessment of online user-generated videos are discussed in detail. We comment on the difficulties and limitations of the quality assessment of user-generated videos in general and provide particular examples demonstrating such difficulties and limitations, helping us understand better the nature of the quality assessment of online videos.
The rest of the paper is organized as follows. Section II describes the background of this study, i.e., visual quality assessment, characteristics of online videos, and previous approaches to the quality assessment of online videos. Section III introduces the dataset used in our study, including video data and subjective data. In Section IV, patterns of user perception of online videos are examined via graph analysis. Section V presents the results of quality estimation using NR quality assessment algorithms and metadata. Section VI discusses issues of the quality assessment of online user-generated videos. Finally, Section VII concludes the paper.
Ii-a Visual Quality Assessment
The overall QoE of a video service highly depends on the perceptual visual quality of the videos provided by the service. One way to score the quality of videos is to have the videos are evaluated by human subjects, which is called subjective quality assessment. For many practical multimedia applications, quality assessment with human subjects is not applicable due to the cost and real-time operation constraints. To deal with this, research has been conducted to develop automatic algorithms that mimic the human perceptual mechanism, which is called objective quality assessment.
Objective quality assessment metrics are classified into three categories: FR, RR, and NR metrics. FR quality assessment uses the entire reference video, which is the original signal without any distortion or quality degradation. Structural similarity (SSIM), multi-scale SSIM (MS-SSIM), most apparent distortion (MAD), and visual information fidelity (VIF) are well-known FR quality metrics for images, and motion-based video integrity evaluation (MOVIE) and spatiotemporal MAD (ST-MAD) are FR metrics for videos. RR metrics do not need the whole reference signal, but use its partial information. Reduced-reference entropic differencing (RRED) for images and video quality metric (VQM) for videos are examples of RR quality metrics.
A challenging situation of objective quality assessment is when there is no reference for the given signal being assessed. Estimating quality from only the given image or video itself is hard, since no prior knowledge of the reference can be utilized. Currently available NR metrics include the blind image integrity notator using discrete cosine transform statistics (BLIINDS-II), the blind/referenceless image spatial quality evaluator (BRISQUE), and the Video BLIINDS (V-BLIINDS)
. These metrics typically use natural scene statistics (NSS) as prior knowledge of images and videos, and the main difference among them lies in how to obtain information about NSS. BLIINDS-II constructs NSS models from the probability distribution of discrete cosine transform (DCT) coefficients extracted from macroblocks. BRISQUE uses mean-subtracted contrast-normalized (MSCN) coefficients rather than transform domain coefficients to speed up the quality assessment process. V-BLIINDS extracts NSS features based on a DCT-based NSS model as in BLIINDS-II, but uses the frame difference to obtain the spatiotemporal information of the video. Additionally, V-BLIINDS uses motion estimation techniques to examine motion consistency in the video.
Ii-B Characteristics of Online Videos
|Step||Types of degradation factors|
|Video||Acquisition||Limited spatial/temporal resolution, misfocusing, blur, jerkiness, camera shaking, noise,|
|occlusion, insufficient/excessive lighting, poor composition, poor color reproduction|
|Processing/editing||Bad transition effect (e.g., fade in/out, overlap), harming caption (e.g., title screen, subtitle),|
|frame-in-frame, inappropriate image processing|
|Uploading||Video compression artifacts, temporal resolution reduction, spatial resolution loss|
|Audio||Acquisition||Device noise, environmental noise, too-loud or too-low volume, incomprehensible language|
|Processing/editing||Inappropriate background music, audio-video desynchronization, unsuitable sound effects,|
|audio track loss|
|Uploading||Audio compression artifacts|
|Video & Audio||Content||Boredom, violence, sexuality, harmful content|
|Delivery||Buffering, packet loss, quality fluctuation|
In online video-sharing services, both user-generated videos and professional videos are shared. In terms of quality, they have significantly different characteristics, especially in filming and editing. Many of the makers of user-generated videos do not have professional knowledge of photography and editing, so quality degradation factors can easily be involved in every step, from the acquisition to the distribution of videos.
Table I presents a list of observable quality degradation factors in online user-generated videos. They are grouped with respect to channels affected by degradation (i.e., video, audio, and audio-video) and steps involving degradation.
Visual factors in the acquisition step consist of problems with equipment, camera skills, and environments. In particular, according to the work in , typical artifacts in user-generated videos include camera shake, harmful occlusions, and camera misalignment. Visual quality degradation also occurs during video processing and editing due to the editor’s lack of knowledge of photography and video making or their intent. For example, scene transition effects, captions, and frame-in-frame effects, where a small image frame is inserted in the main video frame, can be used, which may degrade visual quality. Image processing (e.g., color modification, sharpening) can be applied during editing, which may not be pleasant to viewers. In the uploading step, the system or the uploader may compress the video or modify the spatial and temporal resolutions of the video, which may introduce compression artifacts, visual information loss, or motion jerkiness.
Audio quality degradation can also occur at each step of acquisition, processing and editing, and uploading. Some of the audio quality factors involved in the acquisition and uploading steps are similar to the visual quality factors (equipment noise and compression, etc.). Moreover, the language used in the recorded speech may have a negative effect on perception when a viewer does not understand the language. In the processing and editing step, inserting inappropriate sound sources, background music, or sound effects may decrease user satisfaction. Loss of the audio track may be a critical issue when the content significantly depends on the sound.
Some quality factors related to the characteristics of the content or communication environment apply to both audio and video channels. First, the content of a video can be a problem. Boring, violent, sexual, and harmful content can spoil the overall experience of watching the video. Second, the communication environment from the server to a viewer is not always guaranteed, so buffering, packet loss, and quality fluctuation may occur, which are critical in streaming multimedia content .
Content in online video-sharing services is usually accompanied with the information reflecting uploading and consumption patterns, provided by uploaders and viewers, which is called metadata. The metadata of a video clip, either assigned by the uploader or automatically extracted from the video, include the title, information of the uploader, upload date, duration, video format, and category. Metadata determined by viewers include the viewcount, comments, and ratings. One can analyze metadata to discover the production and consumption patterns of online videos and to improve the quality of service. Moreover, information from metadata (e.g., video links and subscribers) can be used to construct a social network consisting of online videos. Analysis of the social network can be used for content recommendations  and investigating the network topology of online video sharing, especially the evolution of online video communities  .
Ii-C Quality Assessment of Online Content
There are few studies that consider particular characteristics of online images and videos. The method proposed in  estimates the quality of online videos using motion estimation, temporal factors to evaluate jerkiness and unstableness, spatial factors (including blockiness, blurriness, dynamic range, and intensity contrast), and video editing styles (including shot length distribution, width, height, and black side ratio). Since these features depend on the genre of video content, robustness is not guaranteed, as pointed out in . The work in  predicted the quality of user-generated images using their social link distribution, the tone of viewer comments, and access logs from other websites. It was discovered that social functionality is more important in determining the quality of user-generated images in online environments than the distortion of the images themselves. Our work deals with videos, which are more challenging for quality assessment than images. In comparison to the aforementioned prior work, we conduct a more comprehensive and thorough analysis of the issue of the quality assessment of online user-generated videos based on state-of-the-art video quality metrics and metadata-driven metrics.
Iii Video and Subjective Dataset
|ID||Description||Quality degradation factors|
|1||A hand drawing portraits on paper with pencil||Fast motion|
|and charcoal from scratch|
|2||A man drawing on the floor with chalk||Time lapse|
|3||A series of animal couples showing friendship||Blur, compression artifacts|
|4||Procedure to cook cheese sandwich (close-up of the food)||Misfocusing|
|5||Two men imitating animals eating food||Compression artifacts, jerkiness|
|6||A baby swimming in a pool||Camera shaking|
|7||Escaping from a chase (first-person perspective)||Fisheye lens effect|
|8||Nature scenes including mountain, cloud, and sea||Blur, compression artifacts|
|9||Cheering university students||Camera shaking, captions|
|(shot by a camera moving around a campus)|
|10||A red fox in a cage||Blur, camera shaking|
|11||Cats and kittens in a house||Blur, misfocusing|
|12||A crowd dancing at a station||Poor color reproduction|
|13||Seven people creating rhythmic sounds with a car||Camera shaking, misfocusing|
|14||People dancing at a square||Camera shaking|
|15||A group of children learning to cook||Jerkiness, misfocusing|
|16||A slide show of nature landscape pictures||Blur, compression artifacts|
|17||Soldiers patrolling streets||Camera shaking, compression artifacts|
|18||A sleeping baby and a cat (close-up shot)||Camera shaking, compression artifacts|
|19||A baby smiling at the camera||Poor color reproduction, compression artifacts|
|20||A man playing a video game||Frame-in-frame, jerkiness|
|21||Cats having fun with water||Blur, camera shaking, captions|
|22||A man playing a video game||Frame-in-frame, jerkiness|
|23||Twin babies talk to each other with gestures||Camera shaking, compression artifacts|
|24||Walking motion from the walker’s viewpoint||Compression artifacts, camera noise|
|25||A man sitting in a car and singing a song||Compression artifacts, misfocusing|
|26||A man playing with a pet bear and a dog on the grass||Misfocusing, shaking image frame|
|27||Pillow fight on a street||Insufficient lighting, varying illumination|
|28||Kittens playing with each other||Blur, weak compression artifacts|
|29||People dancing and cheering outside||Compression artifacts, packet loss, excessive lighting|
|30||A baby laughing loudly||Camera noise, compression artifacts, captions|
|31||A man breakdancing in a fitness center||Packet loss, compression artifacts, camera noise|
|32||Cheerleading in a basketball court||Compression artifacts, misfocusing|
|33||Microlight flying (first-person perspective)||Blur, compression artifacts, misfocusing, packet loss|
|34||A man participating in parkour and freerunning||Compression artifacts, misfocusing, camera shaking,|
|35||Three people singing in a car||Camera shaking, compression artifacts, blur|
|36||Street posing performance||Camera shaking, occlusion, blur|
|37||A puppy and a kitten playing roughly||Low frame rate, compression artifacts, blur, jerkiness|
|38||Exploring a dormitory building (first-person perspective)||Camera shaking, compression artifacts, misfocusing|
|39||Shopping people at a supermarket||Compression artifacts, captions, blur|
|40||A man working out in a park||Low frame rate, insufficient lighting, blur, varying illumination,|
|poor color reproduction|
|41||People in a hotel lobby||Camera shaking, misfocusing, occlusion|
|42||Bike trick performance||Compression artifacts, blur, misfocusing, jerkiness|
|43||A man cooking and eating chicken||Camera shaking|
|44||A baby walking around with his toy cart at home||Vertical black sides, camera shaking, varying illumination|
|45||Men eating food||Camera shaking, misfocusing, compression artifacts|
|46||An old singing performance clip||Camera noise, poor color reproduction, compression artifacts, blur|
|47||A man doing sandsack training||Compression artifacts, jerkiness, camera noise|
|48||A series of short clips||Poor color reproduction, varying illumination,|
|compression artifacts, camera noise|
|49||Two men practicing martial arts||Black line running, poor color reproduction,|
|jerkiness, blur, packet loss|
|Max resolution (height)||144||1080||-||480|
|Channel description length||0||911||91||0|
Days until the upload date since YouTube was activated (14 Feb. 2005).
In this section, we introduce the video and subjective dataset used in this work. We use the dataset presented in . In , 50 user-generated amateur videos and their metadata were collected from YouTube via keyword search, and subjective quality ratings for the videos were obtained from a crowdsourcing-based subjective quality assessment experiment. The description and observed quality degradation factors of the videos are presented in Table II.
Since the metadata collected in  were rather dated and limited, we collected more detailed and recent metadata for the videos by using YouTube Data API. For each video, the following metadata were gathered: the maximum spatial resolution, upload date, video length, video viewcount, video likes, video dislikes, video favorites, video comments, video description length, channel viewcount, channel dislikes, channel favorites, channel comments, channel description length, and number of uploaded videos of the uploader. The collection process for metadata was conducted in March 2014. While collecting recent metadata, we found that one video in the original dataset was deleted, which was excluded from our experiment. Table III presents the metadata statistics of the videos. It can be seen that the dataset covers a wide range of online user-generated videos in various viewpoints of production and consumption characteristics.
The subjective ratings were collected based on the paired comparison methodology  in . Subjects were recruited from Amazon Mechanical Turk. A web page showing two randomly selected videos from the dataset in a side-by-side manner was used for quality comparison. Subjects were asked to choose a video with better visual quality than the other. Subjects had to play both videos before entering their ratings to prevent cheating and false ratings. In total, 8,471 paired-comparison results were obtained. Each video was shown 332 times, and each pair was matched 6.78 times on average.
Iv Graph-based Subjective Data Analysis
The subjective paired comparison data forms an adjacency matrix representing a set of edges of a graph . Here, is the set of nodes corresponding to the videos, and is the set of weighted directed edges, where each weight is the winning count where a video is preferred to another one. Therefore, it is possible to apply graph theory to analyze the subjective data, which aims at obtaining further insight into viewers’ patterns of quality perception. In this section, two techniques are adopted, HodgeRank analysis and graph clustering.
Iv-a HodgeRank Analysis
|[trim=.050pt .050pt .00pt 0,clip,width=3.4in]HodgeRank_Total_160405||[trim=.050pt .050pt .00pt 0,clip,width=3.4in]HodgeRank_Global_160405|
|[trim=.050pt .050pt .00pt 0,clip,width=3.4in]HodgeRank_Curl_160405||[trim=.050pt .050pt .00pt 0,clip,width=3.4in]HodgeRank_Harmonics_160405|
. Each axis represents the videos sorted by the global score of HodgeRank. In each figure, only the lower triangular part below the diagonal axis is shown because the matrices are skew-symmetric.
The HodgeRank framework introduced in  decomposes imbalanced and incomplete paired-comparison data into the quality scores of video stimuli and inconsistency of subjects’ judgments. In HodgeRank analysis, the statistical rank aggregation problem is posed, which finds the global score , where is the number of video stimuli, such that
where is the number of comparisons between stimuli and , and and are the quality scores of stimuli and , respectively, which are considered as mean opinion scores (MOS). is the th element of , which is the subjective data matrix derived from the original graph of paired comparison by
Here, is the observed winning rate of stimulus against stimulus , which is defined as
where is the number of counts where stimulus is preferred to stimulus . Note that is skew-symmetric.
The converted subjective data matrix can be uniquely decomposed into three components as follows, which is called HodgeRank decomposition:
where ,, and satisfy the following conditions:
where and are the estimated scores for stimuli and , respectively.
The global part determines the overall flow of the graph, which is formed by score differences. The curl part indicates the local (triangle) inconsistency (i.e., the situation where stimulus is preferred to stimulus , stimulus is preferred to stimulus , and stimulus is preferred to stimulus for different ). The harmonic part represents the inconsistency caused by cyclic ties involving more than three nodes, which corresponds to the global inconsistency.
Fig.1 shows the results of the HodgeRank decomposition applied to the subjective data. The overall trend shown in Fig. 1(a) is that the total scores decrease (darker color) for elements closer to the lower left corner, showing that the perceived superiority of one video against another becomes clear as their quality difference increases. This is reflected in the global part in Fig. 1(b), i.e., the absolute values of the matrix elements increase (darker color).
where is the Frobenius norm of a matrix. The obtained ratio of total inconsistency for the subjective data is 67%. That is, the amount of inconsistency, including local inconsistency and global inconsistency, is larger than that of the global flow of the graph. Between the two sources of inconsistency, the amount of the harmonic component is far smaller than that of the curl component, as can be seen from the scale of the color bar in Fig. 1(d). This implies that it is easy for human subjects to determine quality superiority between videos with significantly different ranks (i.e., quality scores), while determining preference for videos where quality is ranked similarly is relatively difficult. This will be discussed further in Section VI.
Iv-B Graph Clustering
The HodgeRank analysis showed that videos with similar ranks in the MOS are subjectively ambiguous in terms of quality. Therefore, one may hypothesize that the videos can be grouped in such a way that different groups have distinguishable quality differences, while videos in each group have a similar quality. We attempt to examine if this is the case and, if so, how many groups can be found via graph clustering.
We use the algorithm presented in , whose objective is to divide the whole graph represented by the adjacency matrix into groups, , by maximizing the modularity measure :
where is the sum of all edge weights in the graph, and and represent the in-degree and out-degree of the -th node, respectively.
This algorithm is based on the random walk with restart (RWR) model. It first computes the relevance matrix , which is the estimated result matrix of the RWR model, from the transition matrix, which equals :
where is the restart probability (an adjustable parameter in the RWR model),
is an identity matrix, andis the column-normalized version of . The algorithm then consists of two steps. First, starting with an arbitrary node, it repeatedly adds the node with the largest single compactness measure, until its single compactness measure does not increase. The single compactness of node with respect to local cluster is represented by:
where is the total sum of the elements of , and and are the row and column sums of , respectively. If the construction of a local cluster from a node is finished, the algorithm starts from another node that has not been assigned to any local clusters yet to make another local cluster. This process is repeated until all nodes are assigned to certain local clusters. After constructing local clusters, the algorithm merges the compact clusters by maximizing the increase of the total modularity of clusters in a greedy manner until there is no increase of , which results in final clusters.
We apply the aforementioned algorithm to the subjective data graph for various restart probability values. Fig. 2 shows the final modularity value with respect to the restart probability, ranging from 0.01 to 0.99. The number of final clusters differs when the restart probability differs. It can be seen that the graph is clustered into one or two groups in all cases. In particular, the results with correspond to the case in which all nodes in the graph are assigned to one cluster. That is, it is difficult to divide the nodes into groups with a clearly distinguished subjective preference. Fig. 3 shows examples of final clustering results that have high modularity. Two clusters are formed in these examples, and the cluster containing high-quality videos (marked with blue) is much bigger than that containing low-quality videos (marked with red). It seems that in the used video dataset, discriminating quality for videos with high and medium quality was difficult, whereas videos with medium and low quality were more easily distinguished.
|[trim=0 0 0 0,clip,width=2in]Graph_Type1||[trim=0 0 0 0,clip,width=2in]Graph_Type2||[trim=0 0 0 0,clip,width=2in]Graph_Type3|
V Quality Estimation
In this section, we investigate the problem of the objective quality assessment of online videos. First, the performance of the state-of-the-art objective quality metrics is evaluated. Second, quality estimation using metadata-driven metrics is investigated.
V-a No-Reference Objective Metrics
The performance of three state-of-the-art NR quality metrics, namely, V-BLIINDS, BRISQUE, and BLIINDS-II, is examined using the video and subjective data described in the previous section. Some videos have title scenes at the beginning (e.g., title text shown on a black background for a few seconds), which are excluded for quality evaluation because they are usually irrelevant for judging video quality.
The performance of the metrics is shown in Table IV. We adopt the Spearman rank-order correlation coefficient (SROCC) as the performance index because the relationship between the metrics’ outputs and MOS is nonlinear. Statistical test results reveal that only the SROCC of BRISQUE is statistically significant at a significance level of (). Interestingly, V-BLIINDS, which is a video quality metric, appears to be inferior to BRISQUE and BLIINDS-II, which are image quality metrics, meaning that the way V-BLIINDS incorporates temporal information is not very effective for the user-generated videos. Overall, the performance of the metrics is far inferior to that shown in existing studies using other databases. The SROCC values of BLIINDS-II and BRISQUE on the LIVE IQA database containing images corrupted by blurring, compression, random noise, etc.  were as high as 0.9250 and 0.9395, respectively . The SROCC of V-BLIINDS for the LIVE VQA database containing videos degraded by compression, packet loss, etc.  was 0.759 in . This implies that the problems of online video quality assessment are very different from those of traditional quality assessment. The reasons why the NR metrics fail are discussed further in Section VI with examples.
V-B Metadata-driven Metrics
|#like / #view||0.5347|
|#subscribe / #channel video||0.4408|
|#like / date||0.3558|
|Channel description length||0.3482|
|#channel viewcount / #channel video||0.3061|
|#view / date||0.2727|
|#comment / #view||0.1861|
|#channel comment / #channel video||0.1414|
Metadata-driven metrics are defined as either the original values of the metadata listed in Table III or the values obtained by combining them (e.g., #like divided by #view for normalization). Table V shows the performance of the metadata-driven metrics for quality prediction. It is observed that the performance of the metrics significantly varies, from fairly high to almost no correlation with the MOS. It is worth noting that several metadata-driven metrics show better performance than the NR quality metrics. Generally, the video-specific metrics (e.g., video viewcount, the number of video comments) show better performance than the channel-specific metrics (e.g., channel viewcount, number of channel comments).
|[trim=0 0 .050pt .050pt,clip, width=3.4in]Metrics_vs_SROCC_Linear_160404||[trim=0 0 .050pt .050pt, clip, width=3.4in]Metrics_vs_SROCC_SVR_160404|
SROCC of (a) linear regression and (b) SVR using metadata and objective quality assessment algorithms.
The metadata-driven metric showing the highest SROCC is the description length. A possible explanation for this is that a video with good quality would have significant visual and semantic information, and the uploader may want to provide a detailed description about the video faithfully. The second-best metadata-driven metric is the ratio of the like count to the viewcount. This metric inherently contains information about the satisfaction of the viewers, which is closely related to the video quality. The third-ranked metadata-driven metric is the maximum spatial resolution. The availability of high resolution for a video means that it was captured with a high-performance camera or that it did not undergo video processing, which would reduce the spatial resolution and possibly degrade video quality. The number of subscribers of an uploader is ranked fourth, which indicates the popularity of the uploader. A popular uploader’s videos are also popular, and their visual quality plays an important role.
Other metrics related to video popularity, including the numbers of likes and dislikes, viewcount, and number of comments show significant correlations with the MOS, but they remain moderate. This indicates that popularity is a meaningful but not perfect predictor of quality. The age of a video (“date” in the table) has a moderate correlation with perceptual quality, which means that newer videos tend to have better quality. This can be understood by considering that recent recording devices usually produce better quality videos than old devices and users are more experienced in video production than before.
Apparently, each of the metadata-driven metrics represents a distinct partial characteristic of the videos. Therefore, it can be expected that combining multiple metrics will yield improved results due to their complementarity. For the integration of the metrics, we employ two techniques, namely, linear regression and nonlinear support vector regression (SVR) using Gaussian kernels. The former is written as:
where is a vector composed of metadata-driven metrics, and and are tunable parameters of the linear regression model. The latter is given as:
where , , (), and are parameters of the SVR model, and is a Gaussian kernel expressed by
is the variance of the Gaussian kernel.
|[trim=0 0 .050pt 0,clip,width=3.4in]Rank_Difference_VBLIINDS_160404||[trim=0 0 .050pt 0,clip,width=3.4in]Rank_Difference_BRISQUE_160404|
|[trim=0 0 .050pt 0,clip,width=3.4in]Rank_Difference_BLIINDS_160404||[trim=0 0 .050pt 0,clip,width=3.4in]Rank_Difference_Metadata_160404|
We train the models by changing the number of metadata-driven metrics, i.e., the length of the input vector in (12) and (13), by adding metrics one by one in descending order of SROCC in Table V. The models are trained and evaluated using all the data without separating the training and test data to obtain insight into the performance potential of the integrated models to estimate quality. In addition, to further incorporate the information of visual signals within the regression models, we also test models combining metadata-driven metrics and V-BLIINDS, BRISQUE, and BLIINDS-II, respectively, to examine the synergy between the two modalities (i.e., the visual signal and metadata).
Fig. 4 shows the results of the linear regression and SVR model evaluation. In both cases, the performance in terms of SROCC is improved by increasing the numbers of the metadata-driven metrics. When 20 metadata-driven metrics are used, the SROCC becomes nearly 0.8 for linear regression and 0.85 for SVR, which is a significant improvement compared with the highest SROCC achieved by a single metric (0.54 by the description length). The effect of incorporating an NR objective quality metric is negligible (for linear regression) or marginal (for SVR). Such a limited contribution of the video signal-based metrics seems to be due to their limited quality predictability as shown in Table IV.
To compare the quality prediction performance of metadata and NR quality metrics in more depth, we analyze differences between the predicted ranks and MOS ranks. Fig. 5 shows histograms of rank differences between the MOS and estimated quality scores by the NR objective quality metrics or SVR combining 20 metadata-driven metrics. It is observed that the SVR model of metadata shows smaller rank differences than the objective quality metrics. From an ANOVA test, it is confirmed that the mean locations of the rank differences for the four cases are significantly different at a significance level of (). Additionally, Duncan’s multiple range tests reveal that the mean location of the rank differences of the SVR-based metadata model is significantly different from those of the signal-based metrics at a significance level of . These results demonstrate that metadata-based quality prediction is a powerful way of dealing with the online video quality assessment problem by overcoming the limitations of the conventional objective metrics based on the visual signal.
In the previous sections, it was shown that the quality evaluation of online user-generated videos is not easy both subjectively and objectively. In this section, we provide further discussion on these issues in more detail with representative examples.
In Section IV, it was observed that viewers have difficulty in determining quality superiority among certain videos. As shown in Table I, there are many different factors of quality degradation in user-generated videos. In many cases, therefore, users are required to compare quality across different factors, which is not only difficult, but also very subjective depending on personal taste. For example, video #15 has jerkiness and misfocusing, video #16 has blur, and video #17 has camera shaking. In the subjective test results, video #16 is preferred to video #15, while video #17 is less preferred than video #15, and video #17 is preferred to video #16. As a result, the match result among these three videos forms a triangular triad, which contributes to the local inconsistency shown in Fig. 1(c).
In Section V, we showed that the state-of-the-art NR objective metrics fail to predict the perceptual quality of user-generated videos. This is largely due to the fact that the quality degradation conditions targeted during the development of the metrics are significantly different from those involved in user-generated videos. When the NR metrics were developed, it was normally assumed that original video sequences have perfect quality. However, as shown in Table I, the user-generated videos are already subject to various types of quality degradation in the production stage (e.g., insufficient lighting, hand shaking). Furthermore, NR metrics are usually only optimized for a limited number of typical quality factors, such as compression artifacts, packet loss, blurring, and random noise, while there are many more quality factors that can be involved during editing, processing, and distribution, some of which are even related to aesthetic aspects.
Editing effects are particularly difficult to assess for the NR metrics, not only because many of them are not considered by the metrics, but also some of them may be wrongly treated as artifacts. Videos #1 and #2 are examples containing unique editing effects in the temporal domain. They are a fast-playing video and a time lapse video, respectively, which are interesting to viewers and thus ranked 1st and 2nd in MOS, respectively. However, as V-BLIINDS regards them as having poor motion consistency, it gives them undesirable low ranks (i.e., 40th and 44th, respectively).
In Section V-B, metadata were shown to be useful to extract quality information, showing better quality evaluation performance than the NR objective metrics. A limitation of metadata is that some are sensitive to popularity, users’ preference, or other video-unrelated factors, which may not perfectly coincide with the perceived quality of videos. Video #25 is such an example. It is a music video made by a fan of a musician. It has moderate visual quality and thus is ranked 25th in MOS. However, since the main content of this video is music, its popularity (the viewcount, the number of likes and comments, etc.) is mainly determined by the audio content rather than the visual quality. Moreover, it has the longest video description (listing the former works of the musician) in the dataset, according to which it would be ranked 14th (note that the description length is the best-performing metric in Table V).
A way to alleviate the limitation of each metadata-driven metric is to combine several metadata-driven metrics and expect them to compensate for the limited information of each of them, which was shown to be effective in our results. For instance, the available maximum resolution was shown to be highly correlated with MOS in Table V, so video #29 would be ranked highly since a high-resolution (1080p) version of this video is available. However, it is ranked only 29th in MOS due to compression artifacts and packet loss artifacts. When multiple metadata-driven metrics are combined using SVR, the rank of the video becomes 24th, which is much closer to the MOS rank.
We have presented our work on investigating the issue of the subjective and objective visual quality assessment of online user-generated video content. First, we examined users’ patterns of quality evaluation of online user-generated videos via the HodgeRank decomposition and graph clustering techniques. A large amount of local inconsistency in the paired-comparison results was found by the HodgeRank analysis, which implies that it is difficult for human viewers to determine quality superiority between videos ranked similarly in MOS, mainly due to the difficulty of comparing quality across different factors. Consequently, subjective distinction between different quality levels is only clear at a large cluster level, which was shown by the graph clustering results. We then benchmarked the performance of existing state-of-the-art NR objective metrics, and explored the potential of metadata-driven metrics for the quality estimation of user-generated video content. It was shown that the existing NR metrics do not yield satisfactory performance, whereas metadata-driven metrics perform significantly better than the NR metrics. In particular, as each of the metadata-driven metrics covers only limited information on visual quality, combining them significantly improved the performance. Finally, based on the results and examples, we provided a detailed discussion on why the subjective and objective quality assessment of user-generated videos is difficult.
Our results demonstrated that the problem of quality assessment of user-generated videos is very different from the conventional video quality assessment problem dealt with in the prior work. At the same time, our results have significant implications for future research: The existence of diverse quality factors involved in user-generated videos and the failure of existing metrics may suggest that the problem of user-generated video quality assessment is too large to be conquered by a single metric; thus, metrics specialized to different factors should be applied separately (and then be combined later, if needed). Since many factors, such as editing effects, are not covered by existing metrics, developing reliable metrics for them would be necessary. Moreover, the problem is highly subjective and depends on personal taste, so personalized quality assessment may be an effective method in the future. Therefore, proper ways to collect personalized data as ground truths would be required, where big data analysis techniques may be helpful.
While our results are for a dataset with a limited number of video sequences, it is still reasonable to consider that most of them, particularly those related to the different nature of user-generated videos from professional ones, can be applied generally, although the matter of degree exists. Nevertheless, larger scale experiments with larger numbers of videos with more diverse characteristics will be desirable in the future.
-  M. A. Saad, A. C. Bovik, and C. Charrier, “Blind prediction of natural video quality,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1352–1365, 2014.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers, vol. 2, 2003, pp. 1398–1402.
-  E. C. Larson and D. M. Chandler, “Most apparent distortion: Full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011 006–1–011 006–21, 2010.
-  H. R. Sheikh and A. C. Bovik, “A visual information fidelity approach to video quality assessment,” in Proceedings of the 1st International Workshop on Video Processing and Quality Metrics for Consumer Electronics, 2005, pp. 23–25.
-  K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos,” IEEE Transactions on Image Processing, vol. 19, no. 2, pp. 335–350, 2010.
-  P. V. Vu, C. T. Vu, and D. M. Chandler, “A spatiotemporal most-apparent-distortion model for video quality assessment,” in Proceedings of the 18th IEEE International Conference on Image Processing, 2011, pp. 2505–2508.
-  R. Soundararajan and A. C. Bovik, “Video quality assessment by reduced reference spatio-temporal entropic differencing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 4, pp. 684–694, 2013.
-  M. H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” IEEE Transactions on Broadcasting, vol. 50, no. 3, pp. 312–322, 2004.
-  M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
-  A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
-  S. Wilk and W. Effelsberg, “The influence of camera shakes, harmful occlusions and camera misalignment on the perceived quality in user generated video,” in Proceedings of IEEE International Conference on Multimedia and Expo, 2014, pp. 1–6.
-  T. Hoßfeld, R. Schatz, and U. R. Krieger, “QoE of youtube video streaming for current internet transport protocols,” in Measurement, Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance. Springer, 2014, pp. 136–150.
-  J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet, U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston et al., “The YouTube video recommendation system,” in Proceedings of the 4th ACM Conference on Recommender Systems, 2010, pp. 293–296.
-  M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon, “Analyzing the video popularity characteristics of large-scale user generated content systems,” IEEE/ACM Transactions on Networking, vol. 17, no. 5, pp. 1357–1370, 2009.
-  F. Figueiredo, J. M. Almeida, M. A. Gonçalves, and F. Benevenuto, “On the dynamics of social media popularity: A Youtube Case study,” ACM Transactions on Internet Technology, vol. 14, no. 4, pp. 24:1–24:23, 2014.
-  T. Xia, T. Mei, G. Hua, Y.-D. Zhang, and X.-S. Hua, “Visual quality assessment for web videos,” Journal of Visual Communication and Image Representation, vol. 21, no. 8, pp. 826–837, 2010.
-  Y. Yang, X. Wang, T. Guan, J. Shen, and L. Yu, “A multi-dimensional image quality prediction model for user-generated images in social networks,” Information Sciences, vol. 281, pp. 601–610, 2014.
-  C.-H. Han and J.-S. Lee, “Quality assessment of on-line videos using metadata,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 1385–1388.
-  J.-S. Lee, “On designing paired comparison experiments for subjective multimedia quality assessment,” IEEE Transactions on Multimedia, vol. 16, no. 2, pp. 564–571, 2014.
-  Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao, “HodgeRank on random graphs for subjective video quality assessment,” IEEE Transactions on Multimedia, vol. 14, no. 3, pp. 844–857, 2012.
-  D. Duan, Y. Li, Y. Jin, and Z. Lu, “Community mining on dynamic weighted directed graphs,” in Proceedings of the 1st ACM International Workshop on Complex Networks Meet Information & Knowledge Management, 2009, pp. 11–18.
-  H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, 2006.
-  K. Gu, G. Zhai, X. Yang, and W. Zhang, “Using free energy principle for blind image quality assessment,” IEEE Transactions on Multimedia, vol. 17, no. 1, pp. 50–63, 2015.
-  K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjective and objective quality assessment of video,” IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, 2010.
-  A. Smola and V. Vapnik, “Support vector regression machines,” Advances in Neural Information Processing Systems, vol. 9, pp. 155–161, 1997.
-  A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.