DeepAI
Log In Sign Up

Quality-Driven Dynamic VVC Frame Partitioning for Efficient Parallel Processing

12/29/2020
by   Thomas Amestoy, et al.
0

VVC is the next generation video coding standard, offering coding capability beyond HEVC standard. The high computational complexity of the latest video coding standards requires high-level parallelism techniques, in order to achieve real-time and low latency encoding and decoding. HEVC and VVC include tile grid partitioning that allows to process simultaneously rectangular regions of a frame with independent threads. The tile grid may be further partitioned into a horizontal sub-grid of Rectangular Slices (RSs), increasing the partitioning flexibility. The dynamic Tile and Rectangular Slice (TRS) partitioning solution proposed in this paper benefits from this flexibility. The TRS partitioning is carried-out at the frame level, taking into account both spatial texture of the content and encoding times of previously encoded frames. The proposed solution searches the best partitioning configuration that minimizes the trade-off between multi-thread encoding time and encoding quality loss. Experiments prove that the proposed solution, compared to uniform TRS partitioning, significantly decreases multi-thread encoding time, with slightly better encoding quality.

READ FULL TEXT VIEW PDF
10/25/2022

Fast multi-encoding to reduce the cost of video streaming

The growth in video Internet traffic and advancements in video attribute...
11/12/2020

CNN-based driving of block partitioning for intra slices encoding

This paper provides a technical overview of a deep-learning-based encode...
11/21/2020

Erdös-Szekeres Partitioning Problem

In this note, we present a substantial improvement on the computational ...
10/22/2019

Recent Advances on HEVC Inter-frame Coding: From Optimization to Implementation and Beyond

High Efficiency Video Coding (HEVC) has doubled the video compression ra...
04/16/2021

CTU Depth Decision Algorithms for HEVC: A Survey

High-Efficiency Video Coding (HEVC) surpasses its predecessors in encodi...
07/14/2018

A Bayesian Approach to Block Structure Inference in AV1-based Multi-rate Video Encoding

Due to differences in frame structure, existing multi-rate video encodin...
08/28/2022

Efficient Motion Modelling with Variable-sized blocks from Hierarchical Cuboidal Partitioning

Motion modelling with block-based architecture has been widely used in v...

1 Introduction

In recent years, the democratization of multimedia applications, coupled with the emergence of high resolution and new video formats (8K, 360°), has led to a drastic increase in the volume of exchanged video content [1]. This increasing need for higher compression rates prompted the jvet to develop a new video coding standard called vvc with coding capability beyond hevc [2]. The bit-rate savings brought by vvc[3] are however coupled with a considerable encoding computational complexity increase. This latter is estimated to 10 and 27 times hevc computational complexity in Inter and Intra coding configuration, respectively [4]. In real-time implementations of vvc codec, intense parallel processing will therefore be mandatory to achieve real-time encoding and decoding.

Techniques of video parallel processing essentially operate at three levels of parallelism: data level, frame level and high-level. The data level parallelism techniques are applied on elementary operations, and no encoding quality is lost compared to sequential encoding. They include among other techniques relying on simd architectures [5]. Frame level and high-level parallelism operate at thread level.The frame level techniques encode a group of frames in parallel where each thread is assigned to a single frame [6]. The encoding time of a single frame is not reduced with frame level techniques, i.e. the latency is not reduced. In high-level parallelism techniques, the threads operate on continuous regions of the frame, as tiles or slices [7]. Tiles and slices are independently encodable and decodable, allowing several threads to process simultaneously the same frame. These techniques improve equally both speed-up and latency. However, by enabling independent processing of frame regions, prediction dependencies across boundaries are broken and entropy encoding state is reinitialized for each region. These restrictions lead to an encoding quality loss compared to the encoding of the non-partitioned sequence. The encoding quality decreases with the number of independent regions of the frame, as has been measured in hevc by Chin et al. [8].

In hevc and vvc standards, only grid shaped tile partitioning is allowed, as shown by Fig. 0(a). The tiles are delimited by the continuous black lines and the dashed lines correspond to the ctu delimitation. The tile partitioning forms a 2x2 grid and tiles are labelled from 0 to 3. In order to increase the partitioning opportunities, vvc combines the tile partitioning with the new concept of rs. The partitioning combining tiles and rs is further called trs partitioning. Fig. 0(b) shows the trs partitioning of a frame into the same 2x2 tile grid than Fig. 0(a), combined with 4 rs. The rs are delimited by the continuous red lines and are labeled from A to D. The rs may contain one or several complete tiles, forming together a rectangular region of the frame. Moreover, as shown in the examples and , a rs may be a rectangular sub-region of the tile, composed of a number of complete and consecutive ctu rows of a tile. In this latter case, the rs allow to further partition the tile grid into a horizontal sub-grid, improving greatly the tile grid partitioning flexibility.

(a) Grid of 4 tiles.Tiles labeled from 0 to 3
(b) Tiles combined with 4 rs. rs labeled from A to D.
Figure 1: Illustration of tile partitioning in hevc and trs partitioning in vvc.

The partitioning of a frame into tiles and rs raises two distinct optimization issues: on one side the multi-thread encoding time minimization (or speedup maximization), on the other side the minimization of encoding quality loss caused by the partitioning. In the literature, both issues have been addressed for hevc tile partitioning. The multi-thread encoding time minimization is investigated by Storch et al [9] and Koziri et al. [10]. They observe that the encoding time does not vary significantly from a ctu to the co-located ctu in the closest temporal frame. Considering this temporal stability, the authors use the encoding times of previous frames to determine the tile partitioning that minimizes the multi-thread encoding time. In [11], the time estimator for each ctu is computed based on previously encoded frame ctu statistics (number of Skip, Inter, Intra blocks for instance). Authors in [12, 13]

minimize the encoding quality loss induced by the tile partitioning by analyzing the ctu luminance variances of the frame. The technique proposed in 

[14] focuses on the particular case of variable number of available cores. The encoding loss is lowered in some cases by setting a number of tiles inferior to the number of available cores. However, the related works on hevc tile partitioning only address independently minimization of encoding time and encoding quality loss.

In this work, we take advantage of the increased flexibility offered by the rs in vvc, in order to propose a dynamic trs partitioning solution under vtm-6.2 software. Prior to the encoding of a frame, the trs partitioning stage uses the spatial information and the times of previously encoded ctu in order to optimize the trs partitioning. The proposed solution minimizes a trade-off between encoding time and encoding video quality, which is a novel approach compared to related works. Moreover, to the best of our knowledge, this is the first work that implements a multi-thread vvc reference encoder, generating baseline results for future related works.

The rest of the paper is organized as follows. Section 2 describes the proposed solution, which establishes the trade-off between encoding time and encoding quality. Section 3 presents and analyses the experimental results on vtm-6.2. Finally, Section 4 concludes this paper.

2 Dynamic Frame Partitioning for Parallel Processing

As mentioned in Section 1, the proposed trs partitioning solution addresses simultaneously the minimization of encoding time and the limitation of encoding quality loss. This section first describes the encoding time minimization of the current frame, using times of previously encoded co-located ctu. The second subsection introduces the clustering of spatial information into the rs to limit the encoding quality loss. The last subsection describes the proposed solution, that establishes a trade-off between encoding time and encoding quality.

2.1 Encoding Time Minimization

Let be the partitioning of current frame into rs: . In the following, is the encoding time of current frame partitioned with , and simultaneously processed by threads in parallel (each thread entirely dedicated to encode a single rs). In this case, is equal to the time required by the slowest thread to encode his rs. Eq. 1 formally establishes , with the encoding time of ctu and the encoding time of the rs .

(1)

Eq. 1 shows that is directly determined by the ctu encoding times . However, during the trs partitioning stage, these values are not available, since the trs partitioning stage takes place before the encoding of current frame. In order to overcome this lack of information, the values are replaced during the trs partitioning stage by estimated values noted .

Several related works [9, 10] define as the encoding time of the co-located ctu (located at the same spatial coordinates) in the closest temporal frame previously encoded. This choice is motivated by the temporal continuity of the video sequences content. In ra configuration, authors in [15] have shown that is more correlated with the times of the co-located ctu in co-tl frame, compared to the co-located ctu of the closest temporal frame. The co-tl frame refers to the previously encoded frame belonging to same temporal layer. This is caused by the shared coding parameters of frames at similar temporal level in the group of pictures structure defined by the ctc [16]. Following the results of [15], the selected estimator is defined as the encoding time of the co-located ctu in the co-tl frame. The encoding time minimization technique consists in the search of a trs partitioning that minimizes the estimated , computed with values as an input.

2.2 Limitation of Encoding Quality Losses

Figure 2: trs partitioning of BQTerrace frame #4, computed with slice clustering.

As mentioned in Section 1, prediction dependencies across rs boundaries are disabled and entropy coding state is reinitialized at each rs. In order to limit the encoding quality loss induced by these restrictions, the optimal trs partitioning

gathers similar spatial information inside the same rs. This corresponds to a K-mean clustering 

[17] of the spatial information into the rs, further called rs clustering. The rs clustering searches the trs partitioning that minimizes the sum of luminance variance on all rs. Eq. 2 computes the partitioning where is the value of luminance samples, and is the mean of rs luminance samples.

(2)

Fig. 2 shows the 8 rs partitioning, obtained by solving Eq. 2 for frame #4 of sequence BQTerrace. In Fig. 2, regions of the frame with similar spatial information tend to be clustered into the same rs. The dark water of the river is almost entirely contained in rs 6 and 7, and the light homogeneous regions of the frame are mainly included in rs 0, 3 and 5. On the other hand, the rs 1, 2 and 4 contain the regions with more complex spatial information.

2.3 Two Steps Slice Partitioning Search

The trs partitioning in Fig. 2 gathers similar spatial information inside the same rs, but is far from optimal regarding the encoding time minimization. For instance, the encoding at of rs #1 is 12 times slower compared to the encoding of rs #3, due among others to the greater area and spatial complexity of rs #1 compared to rs #3. The encoding time of the considered frame is therefore sub-optimal due to the high encoding time of rs #1. In order to reduce such imbalances between rs encoding times, the proposed solution combines the rs clustering (Section 2.2) with the encoding time minimization technique (Section 2.1).

The proposed solution is represented as a flowchart in Fig. 3. The trs partitioning stage, enclosed in the blue dashed box, is applied prior to the parallel encoding of current frame , enclosed in the red dashed box. The trs partitioning stage is divided into 2 distinct steps. The first step is called encoding time minimization step. This step computes the minimum estimated encoding time, defined by Equation 3 and noted .

(3)

The encoding time minimization step takes the ctu times of the co-tl frame as input.

Figure 3: Proposed solution flowchart.

The second step of the trs partitioning stage computes the rs clustering of , under encoding time constraint. This step takes as inputs estimated during previous step, the luminance samples of , and a lagrangian parameter that manages the trade-off between encoding time and encoding quality. The possible values for are bounded by Eq. 4.

(4)

When , only the partitioning that minimizes the estimated time is considered, since . When increases, more partitioning opportunities are offered to the rs clustering, and therefore higher weight is given to encoding quality compared to encoding time minimization. The parameter is therefore a means for the encoder to manage the trade-off, according to the requirement.

The aim of this paper is to show the relevance of a solution combining the 2 complementary steps previously presented. For this reason, a near exhaustive search is conducted to compute both and rs clustering. As shown in Fig. 3, the only constraint given to the search algorithm: , with and the area of the smallest and the largest rs, respectively. The constant is set to

in this work in order to contain search complexity. The choice of less complex heuristics for the trs partitioning stage is a distinct issue, that will be part of future works. The global complexity overhead induced by the trs partitioning stage is nonetheless measured and discussed further in this paper.

3 Experimental Results

This section presents the experimental setup, as well as the performance of the proposed trs partitioning solution.

3.1 Experimental Setup

The following experiments are conducted under vtm-6.2 software, built with gcc compiler version 7.4.0, under Linux version 4.15.0-74-generic as distributed in Ubuntu-18.04.1. The platform setup is composed of cpu Intel(R) Xeon(R) E5-2690 v3 clocked at 2.60 GHz, each of them disposing of 12 cores. The cores have each 768KB L1 cache, 3MB L2 cache and 30MB L3 cache.

The high-level parallelism structures included in vvc standard allow to tackle complexity increase on multi-core processors. This complexity increase raises a critical issue mainly for high resolution video sequences. For this reason, the test sequences selected in this work contain 4 uhd and 5 fhd sequences included in the ctc [16]: CatRobot1, DaylightRoad2, FoodMarket4, Tango2 (uhd), and BQTerrace, Cactus, MarketPlace, RitualDance (fhd). The test sequences are encoded under ra configuration at four qp values: 22, 27, 32, 37. The performance of our trs partitioning solution is assessed by measuring the trade-off between the encoding quality using the bdr [18] and the multi-thread speed-up , defined by Eq. 5.

(5)

and are the original time (encoded with 1 rs and 1 single thread) and reduced time (encoded with N rs and N threads) spent to encode the video sequence with , respectively. The overhead induced by trs partitioning stage is further noted and measured in percentage of .

3.2 Performance of the Proposed Solution

The theoretical upper bound in terms of speed-up, noted , for the proposed solution is computed with the Amdahl law [19]. Let be the sequential part (in ) of an application. The upper bound obtainable with threads is expressed by Eq. 6.

(6)

In our case, the sequential portion of vtm-6.2 encoder contains the data initialization, entropy, in-loop filter and bitstream writing stages. All together, these stages represent of the encoding time in average across test sequences and qp values. Therefore, Eq. 6 provides the following upper bounds: , and .

As mentioned in Section 2.3, the lagrangian parameter manages the trade-off between encoding quality and encoding time minimization induced by the trs partitioning. Three values of parameter (0, 0.1 and 0.3) are tested, and the one offering the best trade-off is selected according to thread number and resolution. Table 1 presents the average results obtained with the selected values, according to the resolution and number of threads . Moreover, the results of the uniform trs partitioning applied on the test sequences is also presented, in order to evaluate the performance of the proposed solution. The uniform trs partitioning is an usual and straightforward technique that partitions the frame in a grid of the same rs dimension.

max width=1 FHD UHD Unif Proposed Unif Proposed bdr (%) 1.62 1.57 1.31 1.27 Speed-up 2.68 3.10 2.91 3.27 0.0 0.0 bdr (%) 2.69 2.80 2.39 2.33 Speed-up 4.27 5.07 4.55 5.34 0.01 0.08 bdr (%) 4.31 3.90 3.26 3.20 Speed-up 5.57 6.44 6.13 7.09 0.54 1.84

Table 1: Average speed-up , bdr and overhead obtained by both uniform and proposed trs partitioning, according to the resolution and number of threads .

Table 1 shows that the proposed trs partitioning solution enables better results compared to uniform trs partitioning in term of , regardless the resolution and number of threads . The increase ranges from to , for uhd content with and , respectively. The proposed trs partitioning solution therefore reduces significantly the distance to the upper bounds computed by Amdahl law, compared to uniform trs partitioning. This significant increase proves the efficiency of the encoding time minimization step, presented in Section 2.1. It is important to note that the encoding time of every frame is reduced. Therefore both speed-up and latency are improved equally by the proposed solution.

In term of bdr, the results of the proposed solution with the selected values are slightly better (around ) compared to uniform trs partitioning. Two exceptions are however noticeable. The bdr decrease is substantial () for fhd content with , and the only case for which the bdr is slightly higher is for fhd content with (). The related works in hevc minimizing the bdr reported  [12] and  [15] average bdr decrease with 8 threads on fhd and uhd content. Our results in term of bdr are therefore close to the results of previously mentioned works, even though these works minimize the bdr without taking into consideration the speed-up optimization.

The conclusion of Table 1 is that the proposed solution is able to maintain the bdr increase to values close to uniform rs partitioning. The variation of value is however not sufficient to decrease significantly the bdr, except for fhd content with . On the other hand, the proposed solution is highly effective to increase the speed-up offered by the trs partitioning in vvc. Regarding the overhead , the values are half induced by the encoding time minimization step, and half by the encoding quality loss limitation step. The values are negligible when and . For , is greater than due to the almost exhaustive search implemented (see Section 2.3). We are confident that the investigation of simple heuristics in future works will reduce greatly , without degrading the results presented in Table 1.

max width=1 8 Threads, uhd Sequences Proposed Solution Proposed Solution Sequence bdr (in %) bdr (in %) CatRobot1 1.38 5.24 1.14 5.19 DaylightRoad 1.82 5.79 1.70 5.70 FoodMarket 4.09 5.16 3.85 5.10 Tango2 2.67 5.54 2.61 5.40 Average 2.49 5.43 2.33 5.34

Table 2: Proposed solution with and , encoded with 8 threads, according to uhd sequence.

Table 2 shows the performance of the proposed solution with and running with 8 threads, according to the uhd sequence. As explained in Section 2.3, the higher , the more importance is given to encoding quality with regard to the speed-up. The results of Table 2 are coherent with this explanation. Indeed, for every sequence the proposed solution with enables better bdr but lower compared to the proposed solution with . In average, the bdr is 0.16% better when selecting , without degrading significantly (-0.09). The results are particularly noticeable for sequence FoodMarket. For this sequence, the bdr is 0.24% better and only decreases by 0.06% when selecting , compared to the proposed solution with .

4 Conclusion

In this paper, a dynamic trs partitioning is proposed for next generation video standard vvc. The proposed solution combines two techniques to minimize multi-thread encoding time and encoding quality loss, respectively. A lagrangian parameter is applied, allowing to select a trade-off between encoding time and encoding quality. The experiments show that the proposed solution decreases significantly multi-thread encoding time, with slightly better encoding quality, compared to uniform rs partitioning. Future works will focus among other points on the improvement of the ctu time estimator, used in the encoding time minimization step. Instead of simply relying on the co-located ctu times of the co-tl frame, future solutions will rely on ctu deduced by motion information. The investigation of lightweight heuristics for the trs partitioning stage will also be part of future works. We are confident they will reduce drastically the overhead, especially for 12 threads encodings of uhd content.

References

  • [1] CISCO, “Global_2021_forecast_highlights,” p. 6, 2016.
  • [2] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
  • [3] Naty Sidaty, Wassim Hamidouche, Olivier Deforges, Pierrick Philippe, and Jerome Fournier, “Compression Performance of the Versatile Video Coding: HD and UHD Visual Quality Monitoring,” in 2019 Picture Coding Symposium (PCS), Ningbo, China, Nov. 2019, pp. 1–5, IEEE.
  • [4] Frank Bossen, Karsten Suehring, and Xiang Li, “JVET-P0003: AHG report: Test model software development (AHG3),” 2019.
  • [5] Benjamin Bross, Mauricio Alvarez-Mesa, Valeri George, Chi Ching Chi, Tobias Mayer, Ben Juurlink, and Thomas Schierl, “HEVC real-time decoding,” San Diego, California, United States, Sept. 2013, p. 88561R.
  • [6] Wassim Hamidouche, Mickael Raulet, and Olivier Deforges, “4K Real-Time and Parallel Software Video Decoder for Multilayer HEVC Extensions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 1, pp. 169–180, Jan. 2016.
  • [7] Kiran Misra, Andrew Segall, Michael Horowitz, Shilin Xu, Arild Fuldseth, and Minhua Zhou, “An Overview of Tiles in HEVC,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 6, pp. 969–977, Dec. 2013.
  • [8] Chi Ching Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare, F. Henry, S. Pateux, and T. Schierl, “Parallel Scalability and Efficiency of HEVC Parallelization Approaches,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1827–1838, Dec. 2012.
  • [9] Iago Storch, Daniel Palomino, Bruno Zatt, and Luciano Agostini, “Speedup-aware history-based tiling algorithm for the HEVC standard,” in 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, Sept. 2016, pp. 824–828, IEEE.
  • [10] Maria Koziri, Panos K. Papadopoulos, Nikos Tziritas, Nikos Giachoudis, Thanasis Loukopoulos, Samee U. Khan, and Georgios I. Stamoulis, “Heuristics for tile parallelism in HEVC,” in 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, Aug. 2017, pp. 1514–1518, IEEE.
  • [11] Yong-Jo Ahn, Tae-Jin Hwang, Dong-Gyu Sim, and Woo-Jin Han, “Complexity model based load-balancing algorithm for parallel tools of HEVC,” in 2013 Visual Communications and Image Processing (VCIP), Kuching, Malaysia, Nov. 2013, pp. 1–5, IEEE.
  • [12] Cauane Blumenberg, Daniel Palomino, Sergio Bampi, and Bruno Zatt, “Adaptive content-based Tile partitioning algorithm for the HEVC standard,” in 2013 Picture Coding Symposium (PCS), San Jose, CA, USA, Dec. 2013, pp. 185–188, IEEE.
  • [13] Xin Jin and Qionghai Dai, “Clustering-Based Content Adaptive Tiles Under On-chip Memory Constraints,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2331–2344, Dec. 2016.
  • [14] Giovani Malossi, Daniel Palomino, Claudio Diniz, Altamiro Susin, and Sergio Bampi, “Adjusting video tiling to available resources in a per-frame basis in High Efficiency Video Coding,” in 2016 14th IEEE International New Circuits and Systems Conference (NEWCAS), Vancouver, BC, Canada, June 2016, pp. 1–4, IEEE.
  • [15] Chia-Hsin Chan, Chun-Chuan Tu, and Wen-Jiin Tsai, “Improve load balancing and coding efficiency of tiles in high efficiency video coding by adaptive tile boundary,” Journal of Electronic Imaging, vol. 26, no. 1, pp. 013006, Jan. 2017.
  • [16] Jill Boyce, Karsten Suehring, and Xiang Li, “JVET-J1010: JVET common test conditions and software reference configurations,” 2018.
  • [17] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,” Applied Statistics, vol. 28, no. 1, pp. 100, 1979.
  • [18] Gisle Bjontegaard, “Calculation of average PSNR differences between RD-Curves,” Apr. 2001, VCEG-M33 ITU-T.
  • [19] Mark D Hill and Michael R Marty, “Amdahl’s Law in the Multicore Era,” p. 6.