1 Introduction
In the multimedia streaming era, efficient video compression has become an essential asset for tackling the increasing demand for higher quality video content and its consumption on multiple devices. New compression techniques have been developed with the aim to compact the representation of video data by identifying and removing spatialtemporal and statistical redundancies within the signal. This results in smaller bitstreams, enabling more efficient storage and transmission as well as distribution of content at higher quality with reduced resources.
Among the fundamental blocks of typical video coding schemes, intra prediction exploits spatial redundancies within a frame by predicting samples of the current block from already reconstructed samples in its close surroundings. The latest draft of the Versatile Video Coding (VVC) standard [4]
(referred to as VVC in the rest of this paper) allows a large number of possible intra modes to be used on the luma component, including up to 67 directional modes and other advanced methods, at the cost of a considerable amount of signalling data. Conversely, to limit the impact of mode signalling on compression performance, a reduced number of modes is employed to intrapredict chroma samples, including the Planar, DC, pure horizontal and pure vertical modes, and the Derived Mode (DM, corresponding to using the same mode used to predict the collocated luma block). In addition, VVC introduced usage of the CrossComponent Linear Model (CCLM, or simply LM in this paper) intra modes. When using CCLM, the chroma component is predicted from the alreadyreconstructed luma samples using a linear model. Usage of LM prediction is effective in improving the efficiency of chroma intraprediction. Nonetheless, the effectiveness of simple linear predictions can be limiting, and as such improved performance can be achieved using more sophisticated Machine Learning (ML) mechanisms
[8, 9]. Differently than these previous methods where neighbouring references are used regardless of their location, this paper proposes a new MLbased crosscomponent intraprediction method which is capable of learning the spatial relations between reference and predicted samples.A new attention module is proposed, to control the contribution of each neighbouring reference sample when computing the prediction of each chroma pixel in the current block sample location, effectively modelling the spatial information in the crosscomponent prediction process. As a result, the proposed scheme better captures the relationship between the luma and chroma components, resulting in more accurate prediction samples.
2 Background
The recent emergence of deep learning methodologies, and the impact of these new techniques in computer vision and image processing, have enabled the development of novel intelligent algorithms outperforming many stateoftheart video compression tasks. In particular in the context of intraprediction, a new algorithm
[9] was introduced based on fully connected layers and Convolutional Neural Networks (CNNs) to map the prediction block positions from the reconstructed neighbouring samples, achieving BDrate (Bjontegaard Delta rate )[1] savings of up to 3.0% on average over HEVC, for about 200% increase in decoding time. The successful integration of CNNbased methods for luma intraprediction into existing codec architectures has motivated the research of alternative methods for chroma prediction, exploiting crosscomponent redundancies similarly to existing LM methods.A novel hybrid neural network for chroma intra prediction [8] was recently introduced. A first CNN was designed to extract features from reconstructed luma samples. This was combined with another fullyconnected network used to extract crosscomponent correlations between neighbouring luma and chroma samples. The resulting architecture was able to derive complex nonlinear mappings for endtoend predicting the Cb and Cr channels, but on the other hand, such approaches typically disregards the spatial location of boundary samples while predicting specific locations of the current block. To this end, an improved crosscomponent intraprediction model based on neural networks is proposed, as illustrated in the rest of this paper.
3 Proposed method
Similarly to the model in [8], the proposed method adopts a scheme based on three network branches that are combined to produce prediction samples. The first two branches work concurrently to extract features from the available reconstructed samples, including the already reconstructed luma block as well as the neighbouring luma and chroma reference samples. The first branch (referred to as crosscomponent boundary branch) aims at extracting crosscomponent information from neighbouring reconstructed samples, using an extended reference array on the left of, and above the current block, as illustrated in Fig. 2
. The second branch (referred to as luma convolutional branch) extracts spatial patterns over the collocated reconstructed luma block applying convolutional operations. The features from the two branches are fused together by means of an attention model, as detailed in the rest of this section. The output of the attention model is finally fed into the third network branch, to produce the resulting Cb and Cr predictions.
An illustration of the proposed network architecture is presented in Fig. 1. Without loss of generality, only square blocks are considered in the rest of this section. After intraprediction and reconstruction of a luma block, its samples can be used for prediction of collocated chroma components. In this discussion, the size of a luma block is assumed to be samples, same as the size of the collocated chroma block. This may require the usage of conventional downsampling operations, for instance in the case of using chroma subsampled picture formats.
For the chroma prediction process, the reference samples used include the collocated luma block , and the array of reference samples , on the topleft of the current block, where , or to refer to the three components, respectively.
is constructed from the samples on the left boundary (starting from the bottommost sample), then the corner is added, and finally the samples on top are added (starting from the leftmost sample). In case some reference samples are not available, these are padded using a predefined value. Finally,
is the crosscomponent volume obtained by concatenating the three reference arrays , and .3.1 Crosscomponent boundary branch
The crosscomponent boundary branch extracts cross component features from by applying consecutive  dimensional convolutional layers to obtain the output feature maps. By applying convolutions, the boundary input dimensions are preserved, resulting in a
dimensional vector of crosscomponent information for each boundary location.
can be expressed in a neural network form as:(1) 
where and are the layer weights and bias respectively, , and
is a Rectified Linear Unit (ReLU) nonlinear activation function.
3.2 Luma convolutional branch
In parallel with the extraction of the cross component features, the reconstructed luma block is fed to a different CNN to produce feature map volumes which represent the spatial patterns present in the luma block. The luma convolutional branch is defined by consecutive dimensional
convolutional layers with a stride of
, to obtain the output feature maps from the input samples. Similar to the previous branch, a bias and a ReLU activation are applied after each convolution operation.(2) 
where and are the layer weights and bias, respectively, and is the input luma block.
3.3 Attentionbased fusion module
The concept of ”attentionbased” learning is a wellknown idea used in deep learning frameworks, to improve the performance of trained networks in complex prediction tasks. The idea behind attention models is to reduce complex tasks by predicting smaller ”areas of attention” that are processed sequentially in order to encourage more efficient learning. In particular, selfattention (or intraattention) is used to assess the impact of particular input variables on the outputs, whereby the prediction is computed focusing on the most relevant elements of the same sequence [5, 11]. Extending this concept to chroma intraprediction, this paper combines the features from the two aforementioned network branches in order to assess the impact of each input variable with respect to their spatial locations. This addresses previous limitations of similar crosscomponent prediction techniques, which generally discard the spatial relation of the neighbouring reference and the predicted samples. The feature maps ( and , from 1 and 2) from the first two network branches are each convolved using a kernel, to project them into two corresponding reduced feature spaces. Specifically, is convolved with a filter to obtain the dimensional feature matrix . Similarly, is convolved with a filter to obtain the dimensional feature matrix . The two matrices are multiplied together to obtain the preattention map . Finally, the attention matrix is obtained applying a softmax operation to each element of
, to generate the probability of each boundary location in being able to predict a sample location in the block. Each value
in is obtained as:(3) 
where represents the sample location in the predicted block, represents a reference sample location, and is the softmax temperature parameter controlling the smoothness of the generated probabilities, with . Notice that the smaller the value of , the more localised are the obtained attention areas resulting in correspondingly less boundary samples contributing to a given prediction location, as further illustrated in Section 4.
The weighted sum of the contribution of each reference sample in predicting a given output sample at specific location is obtained by computing the dot product between the crosscomponent boundary features (Eq. 1) and the attention matrix (Eq. 3), or formally , where is the dot product. In order to further refine , this weighted sum can be multiplied by the output of the luma branch. To do so, the output of the luma branch must be transformed to change its dimensions by means of a convolution using a matrix to obtain a transformed representation, as in:
(4) 
where is the elementwise product.
3.4 Prediction head branch
The output of the attention model is further fed into the third network branch, to compute the predicted chroma samples. In this branch, a final CNN is used to map the fused features from the first two branches as combined by means of the attention model into the output Cb and Cr predicted components. The prediction head branch is defined by two convolutional layers, applying dimensional convolutional filters and then dimensional filters for producing the output predicted values. As can be noticed, both components Cb and Cr are obtained at once following this operation. The use of the first convolutional layer is evaluated at Table 4, observing an increase in prediction accuracy when it is applied.
4 Experimental Results
The proposed method is integrated in VVC test model VTM 7.0 [6]. Only , and square blocks are supported. The resulting module was implemented as a separate mode whose usage can be signalled in the bitstream, complementing the existing VVC chroma intra prediction methods on the supported block sizes. Moreover, 4 : 2 : 0 chroma subsampling is assumed, where the same downsampling filters implemented in VVC are used to downsample collocated luma blocks to the size of the corresponding chroma block.
Training examples were extracted from the DIV2K dataset [10], which contains highdefinition highresolution content of large diversity. This database contains training samples and samples for validation, providing lower resolution versions by downsampling by the factors of , and with a bilinear and unknown filters. For each data instance, one resolution version was randomly selected and then M blocks of were chosen, making balanced sets between block sizes and uniformed spatial selections within each image. All samples were converted to YCbCr colour space and further normalised to be in the range . Networks for all targeted block sizes were trained from scratch using mean squared error loss between the predicted colour components and the ground truth data, using Adam optimiser [7] with a learning rate of .
Several constraints were considered during the implementation process. The proposed models handle variable block sizes by adapting their architecture capacity based on a tradeoff between model complexity and prediction performance. As proposed in the stateoftheart hybrid method based on CNNs [8], giving a fixed network structure, the depth of the convolutional layers is the most predominant factor when dealing with variable input sizes. Table 1
shows the chosen hyperparameters with respect to the input block size. On the other hand, the dimension parameter
within the attention module was set to for all the trained models, following a tradeoff between performance and complexity. Finally, the softmax temperature was crossvalidated into , ensuring a suitable balance between informative samples and noisy ones from the boundary locations. Trained models were plugged into VVC as a new chroma prediction mode, competing with traditional modes for the supported , and block sizes. Then, for each prediction unit, the encoder will choose between the traditional angular modes, LM models or the proposed neural network mode by minimising a ratedistortion cost criterion.Branch  

CC Boundary  16, 32  32, 64  64, 96 
Luma Conv  32, 32  64, 64  96, 96 
Attention  16, 16, 32  16, 16, 64  16, 16, 96 
Output  32, 2  64, 2  96, 2 
The proposed methodology is tested under the Common Test Conditions (CTC) [3], using the suggested allintra configuration for VVC with a QP of 22, 27, 32 and 37. BDrate is adopted to evaluate the relative compression efficiency with respect to the latest VVC anchor. Besides, a joint crosscomponent metric (YCbCr) [1] is considered to evaluate the influence of the chroma gains when signalling the luma component. Test sequences include 26 video sequences of different resolutions known as Classes A, B, C, D and E. Due to the nature of the training set, only natural content sequences were considered, and screen content sequences (Class F in the CTC) were excluded from the tests. It is worth mentioning that in these tests, all block sizes were allowed to be used by the VVC encoder, including all rectangular shapes as well as larger blocks that are not supported by the proposed method. As such, the algorithm potential is highly limited, given that it is only applied to a limited range of blocks. Nonetheless, the algorithm is capable of providing consistent compression gains. The overall results are summarised in Table 2, showing average BDrate reductions of , and for Y, Cb and Cr components respectively, and an average joint YCbCr BDrate (calculated as in [2]) reduction of .
Moreover, in order to further evaluate performance of the scheme, a constrained test is also performed whereby the VVC partitioning process is limited to using only the supported square blocks of , and sizes. A corresponding anchor was generated for this test. Table 3 summarises the results for the constrained test, showing a considerable improvement over the constrained VVC anchor. Average BDrate reductions of , and are reported for the Y, Cb and Cr components respectively, as well as an average joint YCbCr reduction of . In terms of complexity, even though several simplifications were considered during the integration process, the proposed solution significantly impacts the encoder and decoder time up to 120% and 947% on average, respectively. Future simplifications have to be adopted in order to increase computational efficiency of the scheme. Finally, the trained models were compared with the stateoftheart hybrid architecture [8] with the aim to evaluate the influence of the proposed attention module. Table 4 summarises the results for prediction accuracy along DIV2K test set by means of averaged PSNR.
5 Conclusions
This paper proposed to improve existing approaches for chroma intraprediction based on neural networks, introducing a new attention module which is capable of learning spatial relations when extracting the correlational features from the neighbouring reference samples to the block prediction samples. The proposed architecture was integrated into the latest VVC anchor, signalled as a new chroma intraprediction mode working in parallel with traditional modes towards predicting the chroma component samples. Experimental results show the effectiveness of the proposed method, achieving a remarkable compression efficiency. As future work, a complete set of network models for all VVC block sizes aim to be implemented in order to ensure a full usage of the proposed approach leading to the promising results shown in the constrained experiment.
References
 [1] (2001) Calculation of average PSNR differences between rdcurves. VCEGM33. Cited by: §2, §4.
 [2] (Geneva, Switzerland, March 2019) On reporting combined YUV BD rates. Document JVETN0341. Cited by: §4.
 [3] (Ljubljana, Slovenia, July 2018) JVET common test conditions and software reference configurations. Document JVETJ1010. Cited by: §4.
 [4] (Geneva, Switzerland, October 2019) Versatile Video Coding (VVC) draft 7. Cited by: §1.
 [5] (2016) Long shortterm memorynetworks for machine reading. Cited by: §3.3.
 [6] (Geneva, October 2019) Algorithm description for versatile video coding and test model 7 (vtm 7). Document JVETP2002. Cited by: §4.
 [7] (2014) Adam: a method for stochastic optimization. Cited by: §4.
 [8] (2018) A hybrid neural network for chroma intra prediction. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1797–1801. Cited by: §1, §2, §3, Table 4, §4, §4.
 [9] (2018) Intra prediction modes based on neural networks. Doc. JVETJ0037v2, Joint Video Exploration Team of ITUT VCEG and ISO/IEC MPEG. Cited by: §1, §2.

[10]
(2017)
Ntire 2017 challenge on single image superresolution: methods and results
. InProceedings of the IEEE conference on computer vision and pattern recognition workshops
, pp. 114–125. Cited by: §4.  [11] (2018) Selfattention generative adversarial networks. Cited by: §3.3.