In traditional video production chains, video content was historically created and provided by a limited number of media producers, such as licensed broadcasters and production companies. With the development of technologies such as multimedia and mobile networks, recent years have witnessed the explosion of user-generated content (UGC) and related services. UGC videos created by end users have several unique characteristics. First, the lack of professional video capture equipment and proper shooting skills make the quality of UGC videos perceptually worse, which has been widely criticized by end-users. Second, the low barriers in video production make the UGC content extremely diverse, and special effects are often incorporated to enhance the user experience. Third, when uploading the compressed UGC videos to the sharing platform, they often undergo another round of compression depending on the requirements of the hosting platform. As such, the artifacts from multiple rounds of compression are induced, while pristine videos are absent in the hosting platform. Nevertheless, in the literature, the progress on UGC video quality assessment is quite limited, and the traditional VQA measures may not accommodate these distinct properties.
To comprehensively evaluate the state-of-the-art VQA measures and promote further development and comparative analysis, extensive subjective quality evaluations of the UGC videos are important, which serve as the prerequisite for the research of objective quality evaluations. There are several publicly available video databases for quality assessment. For example, LIVE [1, 2] and LIVE Mobile [3, 4] collect pristine source videos, which subsequently undergo various types of distortions. In MCL-JCV , a compressed video quality assessment database is created based on the just noticeable difference (JND) model. Apparently, these databases with high quality source videos may not align with the UGC application scenarios as mentioned above. There are also some databases that are more realistic in the UGC application scenario. LIVE-Qualcomm Mobile In-Capture  contains videos with a variety of complex distortions during the acquisition process, and KoNViD-1k  is a subjectively annotated VQA database consisting of 1,200 public-domain video sequences. Moreover, in , the author has studied the quality assessment of images undergoing multiple distortion stages. Though these databases are more relevant to UGC videos, existing databases do not suffice to simulate the UGC production chain from acquisition to the processing on the hosting platform. Motivated by the above observations, in this work, we create a new UGC video database which reflects the realistic scenarios: UGC-VIDEO. This database contains 50 source videos from TikTok with a variety of the video content. Regarding the distorted videos, we used two different compression standards to simulate the compression in the hosting platforms: H.264/AVC  and H.265/HEVC . Subjective test was conducted to evaluate the visual quality, based on which the comprehensive evaluations on objective models were further performed.
Ii UGC Database
The source videos in the database are intended to represent typical videos that users captured by their mobile phones and uploaded to TikTok.
Ii-a Video collection
To cover diverse content representing typical UGC videos, we began with a large-scale UGC video database containing 10000 videos, which were randomly selected from the videos uploaded to TikTok. These videos are diverse in terms of shooting equipment, content, quality, resolution, duration, frame rate, etc. Subsequently, from the large scale database, we selected videos according to the following principles:
Duration longer than 10 seconds
Had a resolution with 7201280 (WH)
Played at around 30 frames per second (FPS)
The selected videos share the same spatial resolution, and 720p is one of the most dominant UGC video formats on mobile phones. Considering the diversity of video content, we further classified the videos into four categories, including selfie, indoor, outdoor and screen content. For selfie videos, most of the areas are occupied by one or more people, and some of them have special effects. Indoor and outdoor are common scenes, and indoor videos are usually shot close-up while outdoor videos are acquired from a distant view. Screen content videos include recorded game video, recorded animation, etc. The filtered subset was further randomly sampled, leading to around 100 videos per category.
Ii-B Statistical study
To perform sampling from these 400 videos in a scientifically sound way, we study the content of the videos statistically based on three attributes, including spatial, temporal and blur.
As suggested in , spatial and temporal information of the scenes are critical factors for the level of impairment that is suffered when the scene is lossy transmitted.
Spatial information: The spatial information (SI) indicates the amount of spatial detail of a frame. Each frame
is filtered by a Sobel filter, the maximum standard deviation over all Sobel-filtered frames is regarded as the SI of the video, which is formulated in Eqn. (1).
Temporal information: The temporal information (TI) indicates the amount of temporal changes of a video. TI is derived based upon the motion difference , which represents the difference of pixel values between successive frames. As such, the maximum standard deviations of motion difference denotes the TI of the video, which is given by Eqn. (2).
The acquired UGC videos are usually accompanied with varying degrees of blurring artifacts, which also significantly affects the video quality. Herein, the blur is assessed by the cumulative probability of blur detection (CPBD) metric. We evenly extract one frame per second from the video, and the average CPBD value over these frames is regarded as the blur measurement.
Ii-C Video sampling
From Fig. 1
we can see that these 400 videos are scattered in most areas of the feature space spanned by SI and TI, and the most densely distributed area is the area where both SI and TI are low. To make the selected video uniformly distributed on calculated SI, TI and blur features, we use the sampling strategy introduced in. In particular let be the original database, while and denote the number of features and videos, respectively. As such, we aim to select a subset with the uniform distribution , each of its columns denoting the Probability Mass Function (PMF) across the dimension which is quantized into bins. Let be a set of binary metrics, denotes whether or not the item of belongs to interval of the target PMF for the dimension
. And we introduce binary vector, where is decision variable determining whether item of belongs to subset , This problem can be formulated in Eqn. (3).
By finding the optimal solution we can sample a subset from the original database that is nearly uniformly distributed over each feature. We performed this operation in each video content category with , and finally selected 12, 13, 13, 12 videos from selfie, indoor, outdoor and screen content videos, respectively. And we extracted the first 10 seconds from these videos and removed the audio parts. These selected 50 videos are well scattered in the feature space of SI and TI as shown in Fig. 1. The snapshots of some selected videos are shown in Fig. 2.
Ii-D Encoding configuration
Considering the fact that our primary goal of investigating the quality assessment of UGC videos for improving the video coding/transcoding performance. For each source, 10 corresponding sequences were generated by further encoding the video using two different codecs and five QPs. Both H.264/AVC encoder x264  and H.265/HEVC encoder x265  were used to encode each of the 50 sequences. The quality of the coded videos were determined by QPs with the values of 22, 27, 32, 37, 42 for both codecs. As such, with the inclusion of source videos, we have 5011 = 550 video sequences in the UGC-VIDEO database.
Iii SUBJECTIVE Evaluations
Iii-a Subjective testing
After collecting the videos, subjective testing was further conducted to obtain the subjective scores. Single-stimulus (SS)  method, more specifically, a absolute category rating with hidden reference (ACR-HR)  paradigm was adopted to obtain opinion scores for the video database. The subjects indicated the quality of each video he/she watched according the five-grade quality scales“Excellent”,“Good”, “Fair”, “Poor” and “Bad”, corresponding to 5 to 1 points.
All the videos in our study were viewed by each subject. To minimize the effects of viewer fatigue, we conducted the study in three sessions, each session lasted about half an hour, containing 16 or 17 source videos as well as all their corresponding compressed versions. All test videos in a session were played one by one in a random order. Besides, 10 “dummpy presentations” of various levels of distortions were introduced at the beginning of each session to stabilize the opinion of subjects, and the data issued from these presentations were not taken into account in the final results of the test.
A program was developed for this study on a Windows PC. The videos were displayed one by one in their native resolution without scaling, and the subject needed to provide a video score by clicking the corresponding button within a few seconds after the video was played. A Cathode Ray Tube (CRT) monitor with a display resolution of 19201440 was used for display and the entire study was conducted using the same monitor. In total, 30 subjects participated in our study.
Iii-B Data analysis
We followed the procedure for screening of subjects as specified in ITU-R BT 500.13 . No subjects were rejected at this stage in our study. Besides the mean opinion score (MOS), the differential mean opinion score (DMOS) can also be computed as the difference between the score that the subject provides for the source and the corresponding distorted videos.
In essence, full reference quality evaluation algorithms can be greatly influenced by the quality of the reference video. Fig. 3 shows two sets of videos where the reference videos are of different quality. Moreover, these two distorted videos are compressed by H.265/HEVC, and their objective quality evaluations reflected solely from full reference measures such as PSNR and SSIM are quite close. However, their visual quality are significantly different, as noticeable blur can be seen in the compressed video from Fig. 3 while little noticeable distortion can be found in Fig. 3.
The curves of DMOS as a function of in each category when transcoding by HEVC are plotted in Fig. 4, where
for each compressed video, and denote the bit rate of reference video and compressed video, respectively. The and DMOS values of each point in the curve are the mean values of all videos under the same category and compression QP. Black dotted line indicates , the points below this line indicate that the quality of videos has been slightly improved after compression. This is due to the fact that for some source videos with obvious distortion or noise, compression plays a role as smoothing or denoising. As the QP increases, the most significant change in terms of appears in outdoor videos, while screen content videos are less affected.
Iv Evaluations of Objective Models
Iv-a VQA for low quality reference videos
We evaluated the performance of several publicly available objective quality assessment algorithms based on subjective scores from the established database.
For image quality assessment methods, they were extended to videos by averaging frame-level quality scores. These evaluated full-reference and reduced-reference metrics include PSNR, SSIM , VIF , Multi-Scale SSIM (MS-SSIM) , SpEED-QA , ViS3 , Video Multi-method Assessment Fusion (VMAF)  and its phone screen viewing model VMAF-phone.
Three performance criteria including Spearman’s rank ordered correlation coefficient (SROCC), Pearson’s linear correlation coefficient (PLCC) and the root-mean-squared error (RMSE) were used in the evaluation. PLCC and RMSE are computed after non-linear regression using a logistic function described in, which is defined as,
To further investigate the influence of low quality reference on the VQA algorithms, the source videos are divided according to their 20th percentile value of all their MOS scores. Then 110 compressed videos with low quality source are classified into the category 1, and the remaining 390 compressed videos are divided into category 2. Table I tabulates the SROCC, PLCC and RMSE between the algorithm scores and DMOS for each category, as well as across all videos. In principle, low quality reference will lead to poor performance of the reference quality measures. The results of this study support our conjecture, and we can see a significant performance degradation on videos with worse quality reference (category 1) comparing to videos with good reference (category 2) for almost all tested algorithms. Also, by combining with the analysis of Fig. 3 above, we can find that when the quality of the reference video is poor, the reference VQA algorithms cannot achieve the desired performance in terms of the correlations with either DMOS or MOS.
Iv-B VQA for videos under different content categories
In the actual UGC transfer process, there is no straightforward and effective way to compute the quality of source videos. As suah, we can only regard these source videos with various quality as “pristin” references. Herein, we evaluated the performance of algorithms for each video content category, and their correlation with MOS scores were computed. Besides the above mentioned reference VQA algorithms, no-reference assessment BRISQUE [25, 26], NIQE , VIIDEO  and BLIINDS  were also evaluated. The SROCC, PLCC and RMSE results of four separate categories and the whole database are shown in Table II.
We can find that the existing algorithms may not provide reasonably accurate predictions on the UGC videos. For most algorithms, they perform the worst on screen content videos, and the reason may be attributed to the fact that most algorithms do not consider the characteristics of this particular video content. No reference models BRISQUE and NIQE trained on natural images may not be appropriate for the the unnatural and artificially generated content. Blind quality algorithms VIIDEO and BLIINDS also perform worse on all categories compared to reference algorithms. These indicate that it is a great challenge to evaluate such diverse videos with various multi-stage distortions without any reference. Besides, by comparing the results on “All data” in Table. I and Table. II, We can find that most reference algorithms are more correlated with DMOS than with MOS.
We have introduced a UGC video database containing diverse UGC videos with subjective ratings. The distinct properties of the database are that the sources videos are selected in a scientifically sound way, and the database was realistic in terms of the distortion process. Full-reference, reduced-reference and no-reference quality assessment algorithms were validated on the proposed database. The low correlations between subjective ratings and objective measures suggest that there is still large room to improve the quality assessment of UGC videos.
-  Kalpana Seshadrinathan, Rajiv Soundararajan, Alan Conrad Bovik, and Lawrence K Cormack, “Study of subjective and objective quality assessment of video,” IEEE transactions on Image Processing, vol. 19, no. 6, pp. 1427–1441, 2010.
-  Kalpana Seshadrinathan, Rajiv Soundararajan, Alan C Bovik, and Lawrence K Cormack, “A subjective study to evaluate video quality assessment algorithms,” in Human vision and electronic imaging XV. International Society for Optics and Photonics, 2010, vol. 7527, p. 75270H.
-  Anush Krishna Moorthy, Lark Kwon Choi, Alan Conrad Bovik, and Gustavo De Veciana, “Video quality assessment on mobile devices: Subjective, behavioral and objective studies,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, no. 6, pp. 652–671, 2012.
-  Anush K Moorthy, Lark K Choi, Gustavo De Veciana, and Alan C Bovik, “Subjective analysis of video quality on mobile devices,” in Sixth International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Scottsdale, Arizona. Citeseer, 2012.
-  Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo, “MCL-JCV: a JND-based H.264/AVC video quality assessment dataset,” in Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, 2016, pp. 1509–1513.
-  Deepti Ghadiyaram, Janice Pan, Alan C Bovik, Anush Krishna Moorthy, Prasanjit Panda, and Kai-Chieh Yang, “In-capture mobile video distortions: A study of subjective behavior and objective algorithms,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2061–2077, 2017.
-  Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe, “The Konstanz natural video database (KoNViD-1k),” in 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2017, pp. 1–6.
-  Shahrukh Athar, Abdul Rehman, and Zhou Wang, “Quality assessment of images undergoing multiple distortion stages,” in Image Processing (ICIP), 2017 IEEE International Conference on. IEEE, 2017, pp. 3175–3179.
-  Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, 2003.
-  Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.
-  ITUT Rec, “P. 910: Subjective video quality assessment methods for multimedia applications,” International Telecommunication Union, Geneva, vol. 2, 2008.
-  Niranjan D Narvekar and Lina J Karam, “A no-reference image blur metric based on the cumulative probability of blur detection (CPBD),” IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2678–2683, 2011.
-  Vassilios Vonikakis, Ramanathan Subramanian, and Stefan Winkler, “Shaping datasets: Optimal data selection for specific target distributions across dimensions,” in 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016, pp. 3753–3757.
-  Loren Merritt and Rahul Vanam, “x264: A high performance H.264/AVC encoder,” [online] http://neuron2. net/library/avc/overview_x264_v8_5. pdf, 2006.
-  “x265 HEVC Encoder/H.265 Video Codec,” http://x265.org/.
-  ITU-R Rec. BT. 500-13, “Methodology for the subjective assessment of the quality of television pictures,” 2012.
-  P ITU-T RECOMMENDATION, “Subjective video quality assessment methods for multimedia applications,” International telecommunication union, 1999.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
-  Hamid R Sheikh and Alan C Bovik, “Image information and visual quality,” in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2004, vol. 3, pp. iii–709.
-  Zhou Wang, Eero P Simoncelli, and Alan C Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. Ieee, 2003, vol. 2, pp. 1398–1402.
-  Christos G Bampis, Praful Gupta, Rajiv Soundararajan, and Alan C Bovik, “SpEED-QA: Spatial efficient entropic differencing for image and video quality,” IEEE Signal Processing Letters, vol. 24, no. 9, pp. 1333–1337, 2017.
-  Phong V Vu and Damon M Chandler, “ViS3: an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices,” Journal of Electronic Imaging, vol. 23, no. 1, pp. 013016, 2014.
-  Anne Aaron, Zhi Li, Megha Manohara, Joe Yuchieh Lin, Eddy Chi-Hao Wu, and C-C Jay Kuo, “Challenges in cloud based ingest and encoding for high quality streaming media,” in 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015, pp. 1732–1736.
-  Hamid R Sheikh, Muhammad F Sabir, and Alan C Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE Transactions on image processing, vol. 15, no. 11, pp. 3440–3451, 2006.
-  A Mittal, AK Moorthy, and AC Bovik, “Referenceless image spatial quality evaluation engine,” in 45th Asilomar Conference on Signals, Systems and Computers, 2011, vol. 38, pp. 53–54.
-  Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
-  Anish Mittal, Rajiv Soundararajan, and Alan C Bovik, “Making a “Completely Blind” Image Quality Analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, pp. 209–212, 2013.
-  Anish Mittal, Michele A Saad, and Alan C Bovik, “A completely blind video integrity oracle,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 289–300, 2015.
-  Michele A Saad, Alan C Bovik, and Christophe Charrier, “Blind prediction of natural video quality,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1352–1365, 2014.