Log In Sign Up

Chroma Intra Prediction with attention-based CNN architectures

by   Marc Gorriz, et al.

Neural networks can be used in video coding to improve chroma intra-prediction. In particular, usage of fully-connected networks has enabled better cross-component prediction with respect to traditional linear models. Nonetheless, state-of-the-art architectures tend to disregard the location of individual reference samples in the prediction process. This paper proposes a new neural network architecture for cross-component intra-prediction. The network uses a novel attention module to model spatial relations between reference and predicted samples. The proposed approach is integrated into the Versatile Video Coding (VVC) prediction pipeline. Experimental results demonstrate compression gains over the latest VVC anchor compared with state-of-the-art chroma intra-prediction methods based on neural networks.


Attention-Based Neural Networks for Chroma Intra Prediction in Video Coding

Neural networks can be successfully used to improve several modules of a...

Spatial Information Refinement for Chroma Intra Prediction in Video Coding

Video compression benefits from advanced chroma intra prediction methods...

Data Clustering-Driven Neural Network for Intra Prediction

As a crucial part of video compression, intra prediction utilizes local ...

Context-adaptive neural network based prediction for image compression

This paper describes a set of neural network architectures, called Predi...

Analytic Simplification of Neural Network based Intra-Prediction Modes for Video Compression

With the increasing demand for video content at higher resolutions, it i...

Sub-sampled Cross-component Prediction for Emerging Video Coding Standards

Cross-component linear model (CCLM) prediction has been repeatedly prove...

HEVC Inter Coding Using Deep Recurrent Neural Networks and Artificial Reference Pictures

The efficiency of motion compensated prediction in modern video codecs h...

1 Introduction

In the multimedia streaming era, efficient video compression has become an essential asset for tackling the increasing demand for higher quality video content and its consumption on multiple devices. New compression techniques have been developed with the aim to compact the representation of video data by identifying and removing spatial-temporal and statistical redundancies within the signal. This results in smaller bitstreams, enabling more efficient storage and transmission as well as distribution of content at higher quality with reduced resources.

Among the fundamental blocks of typical video coding schemes, intra prediction exploits spatial redundancies within a frame by predicting samples of the current block from already reconstructed samples in its close surroundings. The latest draft of the Versatile Video Coding (VVC) standard [4]

(referred to as VVC in the rest of this paper) allows a large number of possible intra modes to be used on the luma component, including up to 67 directional modes and other advanced methods, at the cost of a considerable amount of signalling data. Conversely, to limit the impact of mode signalling on compression performance, a reduced number of modes is employed to intra-predict chroma samples, including the Planar, DC, pure horizontal and pure vertical modes, and the Derived Mode (DM, corresponding to using the same mode used to predict the collocated luma block). In addition, VVC introduced usage of the Cross-Component Linear Model (CCLM, or simply LM in this paper) intra modes. When using CCLM, the chroma component is predicted from the already-reconstructed luma samples using a linear model. Usage of LM prediction is effective in improving the efficiency of chroma intra-prediction. Nonetheless, the effectiveness of simple linear predictions can be limiting, and as such improved performance can be achieved using more sophisticated Machine Learning (ML) mechanisms

[8, 9]. Differently than these previous methods where neighbouring references are used regardless of their location, this paper proposes a new ML-based cross-component intra-prediction method which is capable of learning the spatial relations between reference and predicted samples.

A new attention module is proposed, to control the contribution of each neighbouring reference sample when computing the prediction of each chroma pixel in the current block sample location, effectively modelling the spatial information in the cross-component prediction process. As a result, the proposed scheme better captures the relationship between the luma and chroma components, resulting in more accurate prediction samples.

2 Background

The recent emergence of deep learning methodologies, and the impact of these new techniques in computer vision and image processing, have enabled the development of novel intelligent algorithms outperforming many state-of-the-art video compression tasks. In particular in the context of intra-prediction, a new algorithm

[9] was introduced based on fully connected layers and Convolutional Neural Networks (CNNs) to map the prediction block positions from the reconstructed neighbouring samples, achieving BD-rate (Bjontegaard Delta rate )[1] savings of up to 3.0% on average over HEVC, for about 200% increase in decoding time. The successful integration of CNN-based methods for luma intra-prediction into existing codec architectures has motivated the research of alternative methods for chroma prediction, exploiting cross-component redundancies similarly to existing LM methods.

A novel hybrid neural network for chroma intra prediction [8] was recently introduced. A first CNN was designed to extract features from reconstructed luma samples. This was combined with another fully-connected network used to extract cross-component correlations between neighbouring luma and chroma samples. The resulting architecture was able to derive complex non-linear mappings for end-to-end predicting the Cb and Cr channels, but on the other hand, such approaches typically disregards the spatial location of boundary samples while predicting specific locations of the current block. To this end, an improved cross-component intra-prediction model based on neural networks is proposed, as illustrated in the rest of this paper.

3 Proposed method

Similarly to the model in [8], the proposed method adopts a scheme based on three network branches that are combined to produce prediction samples. The first two branches work concurrently to extract features from the available reconstructed samples, including the already reconstructed luma block as well as the neighbouring luma and chroma reference samples. The first branch (referred to as cross-component boundary branch) aims at extracting cross-component information from neighbouring reconstructed samples, using an extended reference array on the left of, and above the current block, as illustrated in Fig. 2

. The second branch (referred to as luma convolutional branch) extracts spatial patterns over the collocated reconstructed luma block applying convolutional operations. The features from the two branches are fused together by means of an attention model, as detailed in the rest of this section. The output of the attention model is finally fed into the third network branch, to produce the resulting Cb and Cr predictions.

An illustration of the proposed network architecture is presented in Fig. 1. Without loss of generality, only square blocks are considered in the rest of this section. After intra-prediction and reconstruction of a luma block, its samples can be used for prediction of collocated chroma components. In this discussion, the size of a luma block is assumed to be samples, same as the size of the collocated chroma block. This may require the usage of conventional downsampling operations, for instance in the case of using chroma sub-sampled picture formats.

For the chroma prediction process, the reference samples used include the collocated luma block , and the array of reference samples , on the top-left of the current block, where , or to refer to the three components, respectively.

is constructed from the samples on the left boundary (starting from the bottom-most sample), then the corner is added, and finally the samples on top are added (starting from the left-most sample). In case some reference samples are not available, these are padded using a predefined value. Finally,

is the cross-component volume obtained by concatenating the three reference arrays , and .

3.1 Cross-component boundary branch

The cross-component boundary branch extracts cross component features from by applying consecutive  - dimensional convolutional layers to obtain the output feature maps. By applying convolutions, the boundary input dimensions are preserved, resulting in a

-dimensional vector of cross-component information for each boundary location.

can be expressed in a neural network form as:


where and are the -layer weights and bias respectively, , and

is a Rectified Linear Unit (ReLU) non-linear activation function.

3.2 Luma convolutional branch

In parallel with the extraction of the cross component features, the reconstructed luma block is fed to a different CNN to produce feature map volumes which represent the spatial patterns present in the luma block. The luma convolutional branch is defined by consecutive -dimensional

convolutional layers with a stride of

, to obtain the output feature maps from the input samples. Similar to the previous branch, a bias and a ReLU activation are applied after each convolution operation.


where and are the -layer weights and bias, respectively, and is the input luma block.

Figure 2: Attention visualisation when predicting a block. Axes represents the block locations and axes the positions.

3.3 Attention-based fusion module

The concept of ”attention-based” learning is a well-known idea used in deep learning frameworks, to improve the performance of trained networks in complex prediction tasks. The idea behind attention models is to reduce complex tasks by predicting smaller ”areas of attention” that are processed sequentially in order to encourage more efficient learning. In particular, self-attention (or intra-attention) is used to assess the impact of particular input variables on the outputs, whereby the prediction is computed focusing on the most relevant elements of the same sequence [5, 11]. Extending this concept to chroma intra-prediction, this paper combines the features from the two aforementioned network branches in order to assess the impact of each input variable with respect to their spatial locations. This addresses previous limitations of similar cross-component prediction techniques, which generally discard the spatial relation of the neighbouring reference and the predicted samples. The feature maps ( and , from  1 and  2) from the first two network branches are each convolved using a kernel, to project them into two corresponding reduced feature spaces. Specifically, is convolved with a filter to obtain the -dimensional feature matrix . Similarly, is convolved with a filter to obtain the -dimensional feature matrix . The two matrices are multiplied together to obtain the pre-attention map . Finally, the attention matrix is obtained applying a softmax operation to each element of

, to generate the probability of each boundary location in being able to predict a sample location in the block. Each value

in is obtained as:


where represents the sample location in the predicted block, represents a reference sample location, and is the softmax temperature parameter controlling the smoothness of the generated probabilities, with . Notice that the smaller the value of , the more localised are the obtained attention areas resulting in correspondingly less boundary samples contributing to a given prediction location, as further illustrated in Section 4.

The weighted sum of the contribution of each reference sample in predicting a given output sample at specific location is obtained by computing the dot product between the cross-component boundary features (Eq. 1) and the attention matrix (Eq. 3), or formally , where is the dot product. In order to further refine , this weighted sum can be multiplied by the output of the luma branch. To do so, the output of the luma branch must be transformed to change its dimensions by means of a convolution using a matrix to obtain a transformed representation, as in:


where is the element-wise product.

3.4 Prediction head branch

The output of the attention model is further fed into the third network branch, to compute the predicted chroma samples. In this branch, a final CNN is used to map the fused features from the first two branches as combined by means of the attention model into the output Cb and Cr predicted components. The prediction head branch is defined by two convolutional layers, applying -dimensional convolutional filters and then -dimensional filters for producing the output predicted values. As can be noticed, both components Cb and Cr are obtained at once following this operation. The use of the first convolutional layer is evaluated at Table 4, observing an increase in prediction accuracy when it is applied.

4 Experimental Results

The proposed method is integrated in VVC test model VTM 7.0 [6]. Only , and square blocks are supported. The resulting module was implemented as a separate mode whose usage can be signalled in the bitstream, complementing the existing VVC chroma intra prediction methods on the supported block sizes. Moreover, 4 : 2 : 0 chroma sub-sampling is assumed, where the same downsampling filters implemented in VVC are used to downsample collocated luma blocks to the size of the corresponding chroma block.

Training examples were extracted from the DIV2K dataset [10], which contains high-definition high-resolution content of large diversity. This database contains training samples and samples for validation, providing lower resolution versions by downsampling by the factors of , and with a bilinear and unknown filters. For each data instance, one resolution version was randomly selected and then M blocks of were chosen, making balanced sets between block sizes and uniformed spatial selections within each image. All samples were converted to YCbCr colour space and further normalised to be in the range . Networks for all targeted block sizes were trained from scratch using mean squared error loss between the predicted colour components and the ground truth data, using Adam optimiser [7] with a learning rate of .

Several constraints were considered during the implementation process. The proposed models handle variable block sizes by adapting their architecture capacity based on a trade-off between model complexity and prediction performance. As proposed in the state-of-the-art hybrid method based on CNNs [8], giving a fixed network structure, the depth of the convolutional layers is the most predominant factor when dealing with variable input sizes. Table 1

shows the chosen hyperparameters with respect to the input block size. On the other hand, the dimension parameter

within the attention module was set to for all the trained models, following a trade-off between performance and complexity. Finally, the softmax temperature was cross-validated into , ensuring a suitable balance between informative samples and noisy ones from the boundary locations. Trained models were plugged into VVC as a new chroma prediction mode, competing with traditional modes for the supported , and block sizes. Then, for each prediction unit, the encoder will choose between the traditional angular modes, LM models or the proposed neural network mode by minimising a rate-distortion cost criterion.

CC Boundary 16, 32 32, 64 64, 96
Luma Conv 32, 32 64, 64 96, 96
Attention 16, 16, 32 16, 16, 64 16, 16, 96
Output 32, 2 64, 2 96, 2
Table 1: Model hyperparameters per block size

The proposed methodology is tested under the Common Test Conditions (CTC) [3], using the suggested all-intra configuration for VVC with a QP of 22, 27, 32 and 37. BD-rate is adopted to evaluate the relative compression efficiency with respect to the latest VVC anchor. Besides, a joint cross-component metric (YCbCr) [1] is considered to evaluate the influence of the chroma gains when signalling the luma component. Test sequences include 26 video sequences of different resolutions known as Classes A, B, C, D and E. Due to the nature of the training set, only natural content sequences were considered, and screen content sequences (Class F in the CTC) were excluded from the tests. It is worth mentioning that in these tests, all block sizes were allowed to be used by the VVC encoder, including all rectangular shapes as well as larger blocks that are not supported by the proposed method. As such, the algorithm potential is highly limited, given that it is only applied to a limited range of blocks. Nonetheless, the algorithm is capable of providing consistent compression gains. The overall results are summarised in Table 2, showing average BD-rate reductions of , and for Y, Cb and Cr components respectively, and an average joint YCbCr BD-rate (calculated as in [2]) reduction of .

Moreover, in order to further evaluate performance of the scheme, a constrained test is also performed whereby the VVC partitioning process is limited to using only the supported square blocks of , and sizes. A corresponding anchor was generated for this test. Table 3 summarises the results for the constrained test, showing a considerable improvement over the constrained VVC anchor. Average BD-rate reductions of , and are reported for the Y, Cb and Cr components respectively, as well as an average joint YCbCr reduction of . In terms of complexity, even though several simplifications were considered during the integration process, the proposed solution significantly impacts the encoder and decoder time up to 120% and 947% on average, respectively. Future simplifications have to be adopted in order to increase computational efficiency of the scheme. Finally, the trained models were compared with the state-of-the-art hybrid architecture [8] with the aim to evaluate the influence of the proposed attention module. Table 4 summarises the results for prediction accuracy along DIV2K test set by means of averaged PSNR.

Y Cb Cr YCbCr Class A1 -0.18% -0.84% -0.58% -0.23% Class A2 -0.13% -0.57% -0.38% -0.19% Class B -0.15% -0.65% -0.67% -0.21% Class C -0.17% -0.63% -0.41% -0.22% Class D -0.17% -0.63% -0.61% -0.21% Class E -0.08% -0.80% -0.47% -0.16% Overall -0.15% -0.68% -0.53% -0.20%

Table 2: BD-rate results anchoring to VTM-7.0

Y Cb Cr YCbCr Class A1 -0.26% -2.17% -1.96% -0.53% Class A2 -0.22% -2.37% -1.64% -0.50% Class B -0.23% -2.00% -2.17% -0.45% Class C -0.26% -1.64% -1.41% -0.44% Class D -0.25% -1.55% -1.67% -0.42% Class E -0.03% -1.35% -1.77% -0.24% Overall -0.22% -1.84% -1.78% -0.43%

Table 3: BD-rate results for constrained test

Model 4x4 8x8 16x16 Hybrid CNN [8] 28.61 31.47 33.36 Ours without head 29.87 32.68 35.77 Ours 30.23 33.13 36.13

Table 4: Prediction performance evaluation (PSNR)

5 Conclusions

This paper proposed to improve existing approaches for chroma intra-prediction based on neural networks, introducing a new attention module which is capable of learning spatial relations when extracting the correlational features from the neighbouring reference samples to the block prediction samples. The proposed architecture was integrated into the latest VVC anchor, signalled as a new chroma intra-prediction mode working in parallel with traditional modes towards predicting the chroma component samples. Experimental results show the effectiveness of the proposed method, achieving a remarkable compression efficiency. As future work, a complete set of network models for all VVC block sizes aim to be implemented in order to ensure a full usage of the proposed approach leading to the promising results shown in the constrained experiment.


  • [1] G. Bjontegaard (2001) Calculation of average PSNR differences between rd-curves. VCEG-M33. Cited by: §2, §4.
  • [2] F. Bossen (Geneva, Switzerland, March 2019) On reporting combined YUV BD rates. Document JVET-N0341. Cited by: §4.
  • [3] J. Boyce, K. Suehring, X. Li, and V. Seregin (Ljubljana, Slovenia, July 2018) JVET common test conditions and software reference configurations. Document JVET-J1010. Cited by: §4.
  • [4] B. Bross, J. Chen, and S. Liu (Geneva, Switzerland, October 2019) Versatile Video Coding (VVC) draft 7. Cited by: §1.
  • [5] J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. Cited by: §3.3.
  • [6] S. K. J. Chen (Geneva, October 2019) Algorithm description for versatile video coding and test model 7 (vtm 7). Document JVET-P2002. Cited by: §4.
  • [7] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: §4.
  • [8] Y. Li, L. Li, Z. Li, J. Yang, N. Xu, D. Liu, and H. Li (2018) A hybrid neural network for chroma intra prediction. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1797–1801. Cited by: §1, §2, §3, Table 4, §4, §4.
  • [9] J. Pfaff, P. Helle, D. Maniry, S. Kaltenstadler, B. Stallenberger, P. Merkle, M. Siekmann, H. Schwarz, D. Marpe, and T. Wiegand (2018) Intra prediction modes based on neural networks. Doc. JVET-J0037-v2, Joint Video Exploration Team of ITU-T VCEG and ISO/IEC MPEG. Cited by: §1, §2.
  • [10] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017)

    Ntire 2017 challenge on single image super-resolution: methods and results


    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    pp. 114–125. Cited by: §4.
  • [11] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. Cited by: §3.3.