PEA265: Perceptual Assessment of Video Compression Artifacts

03/01/2019
by   Liqun Lin, et al.
University of Waterloo
0

The most widely used video encoders share a common hybrid coding framework that includes block-based motion estimation/compensation and block-based transform coding. Despite their high coding efficiency, the encoded videos often exhibit visually annoying artifacts, denoted as Perceivable Encoding Artifacts (PEAs), which significantly degrade the visual Qualityof- Experience (QoE) of end users. To monitor and improve visual QoE, it is crucial to develop subjective and objective measures that can identify and quantify various types of PEAs. In this work, we make the first attempt to build a large-scale subjectlabelled database composed of H.265/HEVC compressed videos containing various PEAs. The database, namely the PEA265 database, includes 4 types of spatial PEAs (i.e. blurring, blocking, ringing and color bleeding) and 2 types of temporal PEAs (i.e. flickering and floating). Each containing at least 60,000 image or video patches with positive and negative labels. To objectively identify these PEAs, we train Convolutional Neural Networks (CNNs) using the PEA265 database. It appears that state-of-theart ResNeXt is capable of identifying each type of PEAs with high accuracy. Furthermore, we define PEA pattern and PEA intensity measures to quantify PEA levels of compressed video sequence. We believe that the PEA265 database and our findings will benefit the future development of video quality assessment methods and perceptually motivated video encoders.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10

05/07/2022

Deep Quality Assessment of Compressed Videos: A Subjective and Objective Study

In the video coding process, the perceived quality of a compressed video...
02/06/2022

Perceptual Coding for Compressed Video Understanding: A New Framework and Benchmark

Most video understanding methods are learned on high-quality videos. How...
08/18/2020

PRNU Estimation from Encoded Videos Using Block-Based Weighting

Estimating the photo-response non-uniformity (PRNU) of an imaging sensor...
12/23/2021

A Survey on Perceptually Optimized Video Coding

Videos are developing in the trends of Ultra High Definition (UHD), High...
12/15/2019

Characterizing Generalized Rate-Distortion Performance of Videos

Rate-distortion (RD) analysis is at the heart of lossy data compression....
03/26/2021

Super-Resolving Compressed Video in Coding Chain

Scaling and lossy coding are widely used in video transmission and stora...
12/15/2019

Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach

Rate-distortion (RD) theory is at the heart of lossy data compression. H...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The last decade has witnessed a booming of High Definition (HD)/Ultra HD (UHD) and 3D/360-degree videos due to the rapid developments of video capturing, transmission and display technologies. According to Cisco Visual Networking Index (VNI)[1], video content has taken over 2/3 bandwidth of current broadband and mobile networks, and will grow to 80%-90% in the visible future. To meet such a demand, it is necessary to improve network bandwidth and maximize video quality under a limited bitrate or bandwidth constraint, where the latter is generally achieved by lossy video coding technologies.

The widely used video coding schemes are lossy for two reasons. Firstly, Shannon’s theorem sets the limit of lossless coding, which cannot fulfill the practical needs on video compression; secondly, the Human Vision System (HVS)[2] is not uniformly sensitive to visual signals at all frequencies, which allows us to suppress certain frequencies with negligible loss of perceptual quality. State-of-the-art video coding schemes, such as H.264 Advanced Video Coding (H.264/AVC) [3], H.265 High Efficiency Video Coding (H.265/HEVC)[4], Google VP8/VP9[5, 6], China’s Audio-Video coding Standards (AVS/AVS2)[7, 8], adopt the conventional hybrid video coding structure. This infrastructure, originated from 1980s[9], consists of a group of standard procedures including intra-frame prediction, inter-frame motion estimation and compensation, followed by spatial transmission, quantization and entropy coding. To facilitate these functions in videos of large sizes, the encoder further divides the frames into slices and coding units. Thereby, when the bitrate is not sufficially high, the compressed video encompasses various types of information loss within and across blocks, slices and units, resulting in visually unnatural structure impairments or perceptual artifacts [10]. These Perceivable Encoding Artifacts (PEAs) greatly degrade the visual Quality-of-Experience (QoE) of users [11, 12].

The detection and classification of PEAs are challenging tasks. In video encoders, conventional quality metrics such as Sum of Absoluted Differences (SAD) [13], Sum of Squared Errors (SSE) [14], Peak-Signal-to-Noise Ratio (PSNR) [15], and Structural SIMilarity (SSIM) index [15] are weak indicators of PEAs. At the user-end, the PEAs are highly visible but not properly measured. Recent developments have greatly put forward the 4K/8K era and user-centric video coding and delivery has become ever important [16]. Meanwhile, the advancement of computing and networking technologies have enabled deep investigations on PEA recognition and quantification.

(a) Reference frame
(b) Compressed frame with blurring artifact
Fig. 1: An example of blurring artifact.

In[17], the classification of diversified PEAs have been elaborated. In [18]

, it is observed that these PEAs have significant impacts on visual quality of H.264/AVC. Specifically, 96% of quality variance could be predicted by the intensities of three common PEAs: blurring, blocking and color bleeding. Until now, blocking and blurring artifacts have been extensively investigated, which are caused by spatial-inconsistent and high-frequency signal losses respectively. In many hybrid encoders, de-blocking filters are introduced to prevent severe blocking artifacts, which may, however, introduce high blurriness

[19]. Other typical artifacts, such as ringing [20] and color bleeding [21], may be generated due to errors in high frequencies of luma and chroma signals, respectively. To address these issues, intricate schemes have been developed to PEA removal [22, 23, 24]. However, due to their high complexities, these algorithms are usually deployed at the post-processing stage instead of video compression. Meanwhile, temporal PEAs have also attracted significant attention. In [25], a simplified robust statistical model and the Huber statistical model for temporal artifact reduction are proposed. Gong et al. [26] presented the hierarchical prediction structure to find plausible reasons of temporal artifacts. Meanwhile, a metric for just noticeable temporal artifact and an efficient temporal PEA eliminating algorithm in video coding were proposed. In addition, Zeng et al. [17]

presented an algorithm detecting and locating the floating artifacts. Despite these efforts, there is still a lack of subjective and objective approaches to systematic PEA recognition and analysis. Recently, deep learning techniques

[27], especially Convolutional Neural Network (CNN) [28], have demonstrated their promise in improving video coding performance [29, 30, 31, 32, 33]. This inspired us to introduce CNN to the recognition of PEAs in hybrid encoding.

(a) Reference frame
(b) Compressed frame with blocking artifact
Fig. 2: An example of blocking artifact.

In this work, we employ state-of-the-art video encoder H.265/HEVC to develop a PEA database, namely the PEA265 database, for PEA recognition. The contributions of this work are as follows:

(1) A subjective-labelled database of compressed videos with PEAs. We select 6 typical PEAs based on [17]. We utilize the H.265/HEVC to encode a group of standard sequences and recruit users to mark all the 6 types of PEAs. Finally, we cut the marked sequences into image/video patches with positive and negative PEA labels. In total, there are 6 typical PEAs and at least 60,000 positive or negative labels are given for each type of PEA.

(2) An objective PEA recognition approach based on CNN. For each type of PEA, we construct and compare LeNet [34] and ResNeXt [35] to recognize PEA types. It appears that state-of-the-art ResNeXt outperforms LeNet in terms of PEA recognition. We are able to achieve an accuracy of at least 80% for all PEAs types.

(3) A PEA intensity measure for a compressed video sequence. By summarizing all PEA recognitions, we obtain an overall PEA intensity measure of a compressed video sequence, which helps characterize the subjective annoyance of PEAs in compressed video.

(a) Reference frame
(b) Compressed frame with ringing artifact
Fig. 3: An example of ringing artifact.
(a) Reference frame
(b) Compressed frame with color bleeding artifact
Fig. 4: An example of color bleeding artifact.

The rest of the paper is organized as follows. In Section II, we discuss diversified PEAs in H.265/HEVC and select 6 types of PEAs to develop our database. In Section III, we elaborate the details of our subjective database including video sequence preparation, subjective testing and data processing. Section IV presents our deep learning-based PEA recognition and the overall PEA intensity measurement. Finally, Section V concludes the paper.

Ii PEA Classification

In this section, we review the PEA classification in [17] and select typical PEAs to develop our subjective database. According to [17]

, the PEAs are classified into spatial and temporal artifacts, where spatial artifacts include blurring, blocking, color bleeding, ringing and basis pattern effect; temporal artifacts include floating, jerkiness and flickering. In this work, we select blurring, blocking, color bleeding, ringing of spatial artifacts and floating, flickering of temporal artifacts in the development of our database. Basis pattern effect and jerkiness artifacts are excluded because: 1) the basis pattern effect has similar visual appearance and has similar origin to the ringing effect; 2) the jerkiness artifacts are caused by image capturing factors such as frame rate instead of compression. We summarize the characteristics and plausible reasons of the 6 typical types of PEAs as follows.

Ii-a Spatial Artifacts

Block-based video coding schemes create various spatial artifacts due to block partitioning and quantization. The spatial artifacts, with different visual appearances, can be identified without temporal reference.

Ii-A1 Blurring

Aiming at a higher compression ratio, the HEVC encoder quantizes transformed residuals discrepantly. When the video signals are reconstructed, high frequency energy may be severely lost, which may lead to visual blur. Perceptually, blurring usually appears as the loss of spatial details or sharpness of edges or texture regions in an image. An example is shown in the marked rectangular region in Fig. 1 (b). It displays the spatial loss of the basketball field.

(a) Reference frame
(b) Compressed frame with flickering artifact
Fig. 5: An example of flickering artifact.
(a) Reference frame
(b) Compressed frame with floating artifact
Fig. 6: An example of floating artifact.

Ii-A2 Blocking

The HEVC encoder is block-based, and all compression processes are performed within non-overlapped blocks. This often results in false discontinuities across block boundaries. The visual appearance of blocking may be different subject to the region of visual discontinuities. In Fig. 2 (b), a blocking example of the horse tail is highlighted in the marked rectangular region.

No Class Sequence (Resolution) Frames Frame rate No Class Sequence (Resolution) Frames Frame rate
1 A Traffic (2560x1600) 150 30fps 13 C BasketballDrill (832x480) 500 50fps
2 A PeopleOnStreet (2560x1600) 150 30fps 14 D RaceHorses (416x240) 300 30fps
3 A NebutaFestival (2560x1600) 300 60fps 15 D BQSquare (416x240) 600 60fps
4 A SteamLocomotive (2560x1600) 300 60fps 16 D BlowingBubbles (416x240) 500 50fps
5 B Kimono (1920x1080) 240 24fps 17 D BasketballPass (416x240) 500 50fps
6 B ParkScene (1920x1080) 240 24fps 18 E FourPeople (1280x720) 600 60fps
7 B Cactus (1920x1080) 500 50fps 19 E Johnny (1280x720) 600 60fps
8 B BQTerrace (1920x1080) 600 60fps 20 E KristenAndSara (1280x720) 600 60fps
9 B BasketballDrive (1920x1080) 500 50fps 21 F BaskeballDrillText (832x480) 500 50fps
10 C RaceHorses (832x480) 300 30fps 22 F SlideEditing (1280x720) 300 30fps
11 C BQMall (832x480) 600 60fps 23 F SlideShow (1280x720) 500 20fps
12 C PartyScene (832x480) 500 50fps
TABLE I: Testing sequences

Ii-A3 Ringing

Ringing is caused by the coarse quantization of high frequency components. When the high frequency component of oscillating structure has a quantization error, the pseudo structure may appear near strong edges (high contrast), which manifests artificial wave-like or ripple structures, denoted as ringing. A ringing example is given in the marked rectangular region in Fig. 3 (b).

Ii-A4 Color bleeding

The chromaticity information is coarsely quantized to cause color bleeding. It is related to the presence of strong chroma variations in the compressed images leading to false color edges. It may be a result of inconsistent image rendering across the luminance and chromatic channels. A color bleeding example is provided in the marked rectangular region in Fig. 4 (b), which exhibits chromatic distortion and additional inconsistent color spreading in the rendering result.

Ii-B Temporal Artifacts

Temporal artifacts are manifested as temporal information loss, and can be identified during video playback.

Ii-B1 Flickering

Flickering is usually frequent brightness or color changes along the time dimension. There are different kinds of flickering including mosquito noise, fine-granularity flickering and coarse-granularity flickering. Mosquito noise is high frequency distortion and the embodiment of the coding effect in the time domain. It moves together with the objects like mosquitoes flying around. It may be caused by the mismatch prediction error of the ringing effect and the motion compensation. The most likely cause of coarse-granulating blinking may be luminance variations across Group-Of-Pictures (GOPs). Fine-granularity flickering may be produced by slow motion and blocking effect. An example is given in the marked rectangular region in Fig. 5 (b). Frequent luminance changes on the surface of the water produce flickering artifacts.

Ii-B2 Floating

Floating refers to the appearance of illusory movements in certain areas rather than their surrounding environment. Visually these regions create a strong illusion as if they are floating on top of the surrounding background. Most often, a scene with a large textured area such as water or trees is captured with cameras moving slowly. The floating artifacts may be due to the skip mode in video coding, which simply copies a block from one frame to another without updating the image details further. Fig. 6 (b) gives a floating example. Visually these regions create a strong illusion as if they are floating on top of the leaves.

Iii PEA265 Database

The development of the PEA265 database is composed of four steps: preparation of test video sequences, subjective PEA region identification, patch labeling, and formation of PEA265 database.

Iii-a Testing Video Sequences

The selection of testing sequences follows the Common Test Conditions (CTC) [36]. These standard test sequences in YUV4:2:0 format are summarized in Table I. We employs HEVC encoder [37] to compress the video sequences with four Quantization parameter (Qp) values of 22, 27, 32 and 37, respectively. Four types of coding structures are covered: all intra, random access, low delay and low delay P. Thus, there are totally 320 encoded sequences. For consistency, the output bit depth is set to 8.

Iii-B Subjective PEA Region Identification

In order to identify all PEAs, we ask subjects (i.e. testees) to label all video sequences. Our testing procedure follows the ITU-R BT.500 [38] document with two phases. In the pre-training phase, all subjects are told about our testing procedures and trained to identify PEAs. In the formal-testing phase, all subjects are asked to watch these sequences and circle PEA regions. The test sequences are presented in random order. Mid-term breaks are set during the formal-testing to avoid visual fatigue. 30 subjects, 14 males and 16 females, aged between 20 and 22, participated in the subjective experiment.

(a) Patch labeling in a compressed video frame
(b) Patch labeling in corresponding reference video frame
Fig. 7: Positive/negative patch labeling for spatial PEAs.
(a) Patch labeling in compressed video frames
(b) Patch labeling in corresponding reference video frames
Fig. 8: Positive/negative patch labeling for temporal PEAs.

Iii-C Patch Labeling

During subjective test, the PEA regions are circled by subjects (may be an ellipse shape) and saved in binary files, from which, we derive positive and negative patches in rectangular or cuboid shapes.

Iii-C1 Spatial artifacts

For spatial artifacts, we label the patches by a sliding window of 3232 or 7272. In a compressed video, if at least half of the pixels within the sliding window belong to this circled region, it is labeled as positive; otherwise negative. Patches belonging to the corresponding frame of uncompressed video are randomly selected and categorized as negative, whether or not they are co-located within the circled region. The ratio between the numbers of the two types of negative patches is 1:2. The labeling process is illustrated in Fig. 7.

Iii-C2 Temporal artifacts

Temporal PEAs appear in a group of successive video frames. When a testee pauses video playback and marks a temporal artifact region, 10 frames starting from the current frames are extracted. The video fragment is then further checked by a spatial sliding window of 3232 or 7272: if at least half of the pixels in this window are within the circled region, then the corresponding cuboid is labelled as positive, otherwise negative. Similar to spatial artifacts, negative temporal patches are also obtained from co-located region in the uncompressed sequences. This process is illustrated in Fig. 8.

Fig. 9: The LeNet-5 structure.

Iii-D Summary of the database

The PEA265 database covers 6 types of PEAs including 4 types of spatial PEAs (blurring, blocking, ringing and color bleeding) and 2 types of temporal PEAs (flickering and floating). Each type of PEAs contains at least 60,000 image or video patches with positive and negative labels, respectively. Three typical PEA (ringing, color bleeding and flickering) patches are of size 3232, and the other two (blurring, blocking and floating) are of size 7272. These patches are stored in binary format. The total data size is about 28Gb. Each PEA patch, is indexed by its video name, frame number, and coordinate position.

Iv CNN-based PEA Recognition

In this section, we utilize the PEA265 database to train a deep-learning-based PEA recognition model. We also propose two metrics, PEA pattern and PEA intensity, which can be further employed in vision-based video processing and coding.

Iv-a Subjective recognition with CNN

We choose two popular CNN architectures, LeNet [34] and ResNeXt [35] in this study. For each type of PEA, we randomly select 50,000 ground-truth samples from PEA265 database. These samples are further split to 75:25 training/testing sets.

PEAs LeNet-5 ResNeXt
Training Testing Training Testing
Blurring 0.6833 0.6768 0.9352 0.8176
Blocking 0.7154 0.7162 0.9514 0.9281
Ringing 0.6946 0.6917 0.8524 0.8356
Color bleeding 0.7172 0.7200 0.8706 0.8494
Flickering 0.6572 0.6496 0.8108 0.8019
Floating 0.7096 0.7087 0.8228 0.8051
TABLE II: Training/Testing Recognition Accuracy sets.

Iv-A1 LeNet-5 network

The LetNet architecture is a classic classifier CNN. In our work, We use eight layers (including input) with its structure given in Fig. 9. The conv1 layer learns 20 convolution filters of size 5

5. We apply a ReLU activation function followed by 2

2 max-pooling in both x

y direction with a stride of 2. The conv2 layer learns 50 convolution filters. Finally, the softmax classifier is applied to return a list of probabilities. The class label with the largest probability is chosen as the final classification from the network. Here, the input samples are of sizes 32

32 or 7272, and are in binary format. In order to obtain a higher accuracy, we augment the training data by rotation, width scaling, height scaling, shear, zoom, horizontal flip and fill mode. After data augmentation, the accuracy improves by about 10% to 70% as shown in Table II.

Fig. 10: A block of ResNeXt with cardinality = 32.

Iv-A2 ResNeXt network

The ResNeXt [34] is a variant of ResNet [39] with the building block shown in Fig. 10

. This block is very similar to the Inception module

[40]. They both comply with the split transform-merge paradigm. Our models are realized by the form of Fig. 10. In the 33 layer of the first block, downsampling of conv3, 4, and 5 is made by stride-2 convolutions in each stage, as suggested in [39]. SGD is utilized with a mini-batch size of 256. The momentum is 0.9, and the weight decay is 0.0001. The initial value of learning rate is set to 0.1, and we divide it by a factor of 10 for three times following the schedule in [39]. The weight initialization of [39]

is adopted, and we realize Batch Normalization (BN)

[41] right after the convolutions. ReLU is performed right after each BN.

CNN LeNet-5 ResNeXt
Elapsed Times 1966m12s 655m17s
TABLE III: ELAPSED TIME (m: minutes, s: seconds) OF TRAINING

By training the recognition model of each type of PEA in LeNet and ResNeXt, we aim to predict whether or not a type of PEA exists in an image/video patch. Note here we do not utilize a multi-target classification because of the non-exclusivity of PEAs (i.e. different types of PEAs coexist within one patch). Based on the above-mentioned two typical CNN networks, we individually train 6 types of PEA identification models. Let TP, FP, TN and FN denote the true positive, false positive, true negative, and false negative rates, respectively, the training and testing accuracy is defined as

. Meanwhile, the cross-entropy loss function is adopted.

Table II lists the classification performance on our PEA datasets. From the results of each individual experimental data, the recognition performance based on ResNeXt are significantly better than that solely based on LeNet. For example, in Table II, the proposed blocking PEA recognition model yields a testing accuracy of 92.81%, nearly 20% higher than that of the LeNet (i.e. 71.62%). Similar results are observed in the other PEA recognition models.

Algorithms Figure 11 (b) Figure 11 (f) Image3000
Ref[17] 96.1% 54.92% 65.17%
Proposed 95.85% 88.23% 85.46%
TABLE IV: Performance comparison of floating PEA recognition algorithm

Compared with LeNet, ResNeXt has more layers, and can learn more complex image high-dimensional features. By repeating a building block, ResNeXt is constructed. The building block aggregates a set of transformations with the same topology. Only a few hyper-parameters need to be set in a homogeneous and multi-branch architecture. Meanwhile, its bottleneck layer reduces the number of features. Thus the operation complexity of each layer reduces. Therefore, the computational complexity greatly reduces, while the speed and accuracy of the algorithm improves.

Fig. 11: An example of floating PEA detection.

The computational complexity of the training and testing procedures using the LeNet and ResNeXt is summarized in Table III. ResNeXt is much faster than LeNet because of the bottleneck layer. The training process requires a large number of iterations and is relatively time-consuming.

(a) PEA pattern with spatial PEA(s)
(b) PEA pattern with temporal PEA(s)
Fig. 12: The PEA pattern of image patches.

Iv-B Comparison with other benchmarks

In order to better illustrate the advantages of the proposed recognition, we compare it with the floating PEA detection method in [17], in which the low-level coding features were extracted to estimate the spatial distribution of floating. Fig. 11 (a) and (e) are two original frames, respectively, and Fig. 11 (b) and (f) are their compressed frames, coded by HEVC with Qp = 42, where the visual floating regions are marked manually. Fig. 11 (c) is the floating map generated by [17], where black regions indicate the floating artifacts. Fig. 11 (d) is the result of the proposed PEA recognition model. In this case, both methods performs reasonably well in floating detection. However, the algorithm in [17] requires content-dependent parameter adjustment and does not generalize consistently. For example, Fig. 11 (g) fails to detect the actual floating region. Compared Fig. 11 (g) with Fig. 11 (h), the proposed floating PEA recognition algorithm performs clearly better. The floating detection accuracy is given in Table IV. In addition, we randomly select 3000 test images, and the performance comparison results are illustrated in the last column of the Table IV. It appears that the proposed floating PEA recognition model consistently outperforms [17].

(a) A compressed frame
(b) Blocking artifact
(c) Blurring artifact
(d) Ringing artifact
(e) Color bleeding artifact
(f) Flickering artifact
(g) Floating artifact
(h) Combined artifacts
Fig. 13: The individual and overall PEA distributions of a frame.

Iv-C The overall PEA intensity

By combining the 6 PEA recognition models, we obtain two hybrid PEA metrics: a local PEA metric, namely PEA pattern, and a holistic PEA metric, namely PEA intensity. A PEA pattern is represented as a 6-bin value, each contains a binary value representing the existence of blurring, blocking, ringing, color bleeding, flickering and floating artifacts respectively. We set it 1 if its corresponding PEA exists; otherwise 0. To intuitively show the PEA pattern, we present two examples in Fig. 12. In Fig. 12 (a), blurring, blocking and ringing artifacts exist in this patch, thus its PEA pattern is labeled as 111000; In Fig. 12

(b), floating artifact exists in this patch, thus its PEA pattern is labeled as 000001. This pattern denotes the feature vector of a video patch in terms of PEAs and thus can be further utilized in vision-based video processing. In addition, we summarize the distributions of all types of PEAs in Fig.

13. It is observed that for a video frame, the distributions of PEAs differ from each other in which all types of PEAs may not be observed simultaneously. Therefore, only a combination of PEAs, such as in Fig. 13 (h), shows the impacts of PEAs on visual quality. We introduce a new metric, PEA intensity, as the percentage of positive binaries (i.e. value 1) within a patch, to illustrate this overall impact. The PEA patterns, 111000 and 000111, have the same PEA intensity because there are 3 positive binaries in both patterns.

(a) Blocking intensity
(b) Blurring intensity
(c) Color Bleeding intensity
(d) Ringing intensity
(e) Flickering intensity
(f) Floating intensity
Fig. 14: The individual PEA intensity for each type of PEA.

For a video sequence, its PEA intensity is then defined as the average PEA intensity of all non-overlapping patches. In Fig. 14, the overall PEA intensities of all CTC sequences are measured and presented. Several conclusions can be drawn here. Firstly, the overall PEA intensity is, in general, positively correlated to the Qp value. For almost all types of PEAs and videos, the PEA intensity grows with a higher Qp. This fact highlights the importance of quantization and information loss in the generation mechanism of PEAs. As discussed before, the potential origin of spatial artifacts are interpreted as the loss of high frequency signals, chrominance signals and inconsistency of information loss between boundaries, while the temporal artifacts are possibly produced by inconsistent information loss between frames. Therefore, the fact that Qp influences PEA intensity is compatible with the above interpretations and also provides guidance to detailed explorations on the generation mechanism of PEAs.

Secondly, the PEA intensity is content-dependent, as it varies subject to video contents. For example, the sequences SlideEditing (1280720, No.22), SlideShow (1280720, No.23) have lower PEA intensities in terms of blocking, blurring and floating; on the other hand, more color bleeding, ringing and flickering artifacts are identified. The sequence Kimono (19201080, No.5) has severe intensities for almost all types of PEAs while the sequence BQSquare (416240, No.15) is with low intensities for almost all PEAs. This implies that the video characteristics, including texture and motion, may have an impact on the PEA intensity when being compressed. It may also provide useful instructions for content-aware video coding optimization.

Thirdly, the frequencies of PEAs can be different subject to its type. In this database, the intensities of blocking, color bleeding and flickering are significant compared with other PEAs including blurring, ringing and floating. Furthermore, the impact on visual quality changes for different types of PEAs. All types of PEAs may not have the same impact on HVS and the visual quality of users may be dominated by parts of PEAs, as concluded in [18]. We put this in future work to explore how PEA detections should be combined to best evaluate their impact on visual quality.

In order to further investigate the differences between spatial and temporal PEAs, we present the averaged PEA intensities for spatial and temporal artifacts in Fig. 15. The aforementioned conclusions can also been verified in this figure.

(a) Spatial PEA intensity
(b) Temporal PEA intensity
Fig. 15: The spatial and temporal PEA intensities on average.

V Conclusion

We construct PEA265, a first-of-its-kind large-scale subject-labelled database of PEAs produced by H.265/HEVC video compression. The database contains 6 spatial and temporal PEA types, including blurring, blocking, ringing, color bleeding, flickering and floating, each with at least 60,000 samples with positive or negative labels. Using the database, we train CNNs to recognize PEAs, and the results show that state-of-the-art ResNext provides high accuracies in PEA detection. Moreover, we define a PEA intensity measure to assess the overall severeness of PEAs in compressed videos. This work will benefit the future development of video quality assessment algorithms. It can also be used to optimize hybrid video encoders for improved perceptual quality and perceptually-motivated video encoding schemes.

References

  • [1] Cisco visual networking index : Forecast and methodology, 2016-2021. [Online]. Available: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.html
  • [2] J. P. Allebach, “Human vision and image rendering: Is the story over, or is it just beginning,” in Proc. SPIE 3299, Human Vision and Electronic Imaging (HVEI) III, Jan. 1998, pp. 26–27.
  • [3] H.264: Advanced video coding for generic audiovisual services, ITU-T Rec, Mar. 2005.
  • [4] G. J. Sullivan, J. R. Ohm, and W. J. Han, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
  • [5] J. Bankoski and P. Wilkins, “Technical overview of vp8, an open source video codec for the web,” in Proc. IEEE International Conference on Multimedia and Expo (ICME), Jul. 2011, pp. 1–6.
  • [6] D. Mukherjee, J. Han, J. Bankoski, and R. Bultje, “The latest open-source video codec vp9 - An overview and preliminary results,” in Picture Coding Symposium (PCS), Feb. 2013, pp. 390–393.
  • [7] F. Liang, S. Ma, and W. Feng, “Overview of AVS video standard,” in IEEE International Conference on Multimedia and Expo (ICME), Jun. 2004, pp. 423–426.
  • [8] Z. He, L. Yu, X. Zheng, S. Ma, and Y. He, “Framework of AVS2-video coding,” in IEEE International Conference on Image Processing (ICIP), Sep. 2014, pp. 1515–1519.
  • [9] M. Yuen and H. R. Wu, “A survey of hybrid MC/DPCM/DCT video coding distortions,” Signal Process., vol. 70, no. 3, pp. 247–278, Oct. 1998.
  • [10] A. Leontaris, P. C. Cosman, and A. R. Reibman, “Quality evaluation of motion-compensated edge artifacts in compressed video,” IEEE Trans. Image Process., vol. 16, no. 4, pp. 943–956, Mar. 2007.
  • [11] A. Eden, “No-reference image quality analysis for compressed video sequences,” IEEE Trans. Broadcast., vol. 54, no. 3, pp. 691–697, Sep. 2008.
  • [12] K. Zhu, C. Li, V. Asari, and D. Saupe, “No-reference video quality assessment based on artifact measurement and statistical analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 4, pp. 533–546, Apr. 2015.
  • [13] K. H. Dang, D. Le, and N. T. Dzung, “Efficient determination of disparity map from stereo images with modified sum of absolute differences (SAD) algorithm,” in Proc. International Conference on Advanced Technologies for Communications (ATC), Jan. 2014, pp. 657–660.
  • [14] H. Kibeya, N. Bahri, M. Ayed, and N. Masmoudi, “SAD and SSE implementation for HEVC encoder on DSP TMS320C6678,” in Proc. International Image Processing, Applications and Systems (IPAS), Mar. 2017, pp. 1–6.
  • [15] A. Hor and D. Ziou, “Image quality metrics: PSNR vs. SSIM,” in

    Proc. International Conference on Pattern Recognition (ICPR)

    , Oct. 2010, pp. 2366–2369.
  • [16] S. Sakaida, Y. Sugito, H. Sakate, and A. Minezawa, “Video coding technology for 8k/4k era,” Inst. Electron. Inf. Commun. Eng., vol. 98, pp. 218–224, Mar. 2015.
  • [17] K. Zeng, T. S. Zhao, A. Rehman, and Z. Wang, “Characterizing perceptual artifacts in compressed video streams,” IST/SPIE Human Vision and Electronic Imaging XIX (invited paper), vol. 9014, no. 10, pp. 2458–2462, Feb. 2014.
  • [18] J. Xia, Y. Shi, K. Teunissen, and I. Heynderickx, “Perceivable artifacts in compressed video and their relation to video quality,” Signal Process-Image., vol. 24, no. 7, pp. 548–556, Aug. 2009.
  • [19] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 614–619, Jul. 2013.
  • [20] S. B. Yoo, K. Choi, and J. B. Ra, “Blind post-processing for ringing and mosquito artifact reduction in coded videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 721–732, May 2014.
  • [21] F. X. Coudoux, M. Gazalet, and P. Corlay, “Reduction of color bleeding for 4:1:1 compressed video,” IEEE Trans. Broadcast., vol. 51, no. 4, pp. 538–542, Dec. 2005.
  • [22] S. Yang, Y. H. Hu, T. Q. Nguyen, and D. L. Tull, “Maximum-likelihood parameter estimation for image ringing-artifact removal,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 8, pp. 963–973, Sep. 2001.
  • [23] S. H. Oguz, Y. H. Hu, and T. Q. Nguyen, “Image coding ringing artifact-reduction using morphological post-filtering,” in Proc.IEEE Second Workshop on Multimedia Signal Processing (MMSP), Jan. 1999, pp. 628–633.
  • [24] H. S. Kong, A. Vetro, and H. Sun, “Edge map guided adaptive postfilter for blocking and ringing artifacts removal,” in International Symposium on Circuits and Systems (ISCAS), Sep. 2004, pp. 929–932.
  • [25] J. X. Yang and H. R. Wu, “Robust filtering technique for reduction of temporal fluctuation in H.264 video sequences,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 3, pp. 458–462, Mar. 2010.
  • [26] Y. C. Gong, S. Wan, K. Yang, F. Z. Yang, and L. Cui, “An efficient algorithm to eliminate temporal pumping artifact in video coding with hierarchical prediction structure,” J. VIS. COMMUN. IMAGE. R, vol. 25, no. 7, pp. 1528–1542, Oct. 2014.
  • [27] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–436, May 2015.
  • [28] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Convolutional neural network committees for handwritten character classification,” in proc. International Conference on Document Analysis and Recognition (ICDAR), Nov. 2011, pp. 1135–1139.
  • [29] C. Dong, Y. Deng, C. L. Chang, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in

    Proc. IEEE International Conference on Computer Vision (ICCV)

    , Feb. 2016, pp. 576–584.
  • [30]

    C. Dong, C. L. Chen, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in

    Proc. European Conference on Computer Vision (ECCV), Jan. 2014, pp. 184–199.
  • [31] Y. Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for post-processing in HEVC Intra Coding,” in International Conference on Multimedia Modeling (MMM), Jan. 2017, pp. 28–39.
  • [32] W. S. Park and M. Kim, “CNN-based in-loop filtering for coding efficiency improvement,” in Proc. Image, Video, and Multidimensional Signal Processing (IVMSP), Aug. 2016, pp. 1–5.
  • [33]

    N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural network approach for half-pel interpolation in video coding,” in

    IEEE International Symposium on Circuits and Systems (ISCAS), Mar. 2017, pp. 1–4.
  • [34] Z. H. Zhao, S. P. Yang, and M. Qiang, “License plate character recognition based on convolutional neural network lenet-5,” J. Syst. Simul., vol. 22, no. 3, pp. 638–641, Mar. 2010.
  • [35] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nov. 2017, pp. 5987–5995.
  • [36] F. Bossen, “HM 10 common test conditions and software reference configurations,” in Proc. Joint Collaborative Team on Video Coding Meeting (JCT-VC), Jan. 2013, pp. 1–3.
  • [37] (2016) Fraunhofer institute for telecommunications, heinrich hertz institute. High Efficiency Video Coding (HEVC) reference software HM. [Online]. Available: https://hevc.hhi.fraunhofer.de/
  • [38] Methodology for the subjective assessment of the quality of television pictures, Recommendation ITU-R BT. 500-13, Geneva, Switzerland:International Telecommunication Union, 2012.
  • [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Aug. 2016, pp. 770–778.
  • [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Dec. 2016, pp. 2818–2826.
  • [41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proc. International Conference on Machine Learning (ICML)

    , Mar. 2015, pp. 1–10.