Modern video coding standards, such as the H.265/High Efficiency Video Coding (HEVC), make use of complex mechanisms to provide remarkable compression efficiency. For distribution, frames are encoded using so called random-access configurations, in which most frames are inter-predicted, while a few intra frames are inserted periodically in the sequence (the number of frames between two intra frames is referred to as the intra-period). Intra frame coding uses prediction to decrease spatial redundancies, transform coding of residual signals, quantisation, and entropy coding to reduce statistical redundancies . Due to the inherent complexity of these modules, it is generally difficult to estimate the effects of an encoder on a given frame in terms of the number of bits and the distortion without actually encoding it. Conversely, rate-control mechanisms typically work by allocating the available number of bits per second among the frames in an intra-period, and then appropriately setting parameters to meet this allocation. Allocating the correct number of bits for intra frames is crucial, since such frames typically need significantly more bits than inter frames (due to the reduced efficiency of the encoder scheme). However, they should also be encoded at the highest quality, as they are used for reference by subsequent inter frames . As such, schemes to accurately predict the number of bits and distortion generated by an intra frame encoder are highly beneficial.
A method based on deep learning to estimate distortion and number of bits needed to encode an intra frame is proposed in this paper. A first CNN is modelled to estimate the compressed frame size, measured as bits-per-pixel (bpp), and the average distortion, measured using the Peak Signal-to-Noise Ratio (PSNR) between original and compressed frames, obtained using different Quantisation Parameters (QPs). An additional CNN is also proposed to estimate distortion maps, namely pixel-wise maps of absolute differences between original and reconstructed frames, which may be used for block-wise rate-control or adaptive-quantisation schemes. The CNN computes the maps based on the original frame and an input QP.
Ii Related Work
Methods based on deep learning have been shown to be very successful in different estimation tasks. In particular, Convolutional Neural Networks (CNNs) have earned a lot of attention in recent years due to their good performance, and have been extensively used for classification and segmentation1], noise removal  or depth estimation .
Deep learning has also been used in video coding for various applications, including: frame partitioning , intra mode selection , arithmetic coding , compressed frame sizes-distortion modelling  and post processing . Laude and Ostermann 
introduced a CNN-based classifier for intra mode decision. The CNN takes an input block, and outputs the predicted intra mode to be used. Training uses original samples to avoid dependencies on other encoder decisions and reconstructed data, allowing to process several blocks in parallel. Liet al.  proposed a learning-based classifier to determine the partitioning of coding tree units (CTUs). Three CNNs are modelled to learn the split decision of CTUs at different depth levels, following maximum and minimum CTU sizes on HEVC. Song et al. 
introduced a two-fold CNN-based arithmetic coding. First, a CNN is used to predict the distribution of the intra modes taking as input the Most Probable Modes (MPMs) of the current block and reconstructed neighbouring blocks. Subsequently, the predicted distributions are used in a multi-level arithmetic coding engine. Zhouet al.  proposed a CNN to replace deblocking filter and Sample Adaptive Offset (SAO).
An approach was presented by Xu et al. 
, where CNNs are used to estimate distortion maps and compressed frame sizes. Firstly, distortion maps are calculated with respect to the Structural Similarity Index (SSIM) between the original frame and its reconstruction. Secondly, compressed frame sizes are estimated, in the form of a vector of bits obtained after encoding a frame using different QPs. Both CNNs only use linear activations and can therefore be modelled as a combination of linear functions.
Iii Proposed Approach
, most video encoders rely on Mean Square Error (MSE) based distortions to perform encoder-side mode decisions. Additionally, due to the non-linearity of several of the encoder blocks, using only linear activations may not be sufficient to provide accurate estimates. Finally, when dealing with practical applications, there may be a need for obtaining a low-complex estimate of distortion and number of bits. As such, the approach proposed here is different from the base CNN in that it is capable of predicting MSE distortions (instead of SSIM values) and makes use of non-linear activation functions. Moreover, in addition to a methodology to obtain local distortion maps, an additional CNN is proposed here which can provide a low-complexity estimate of average distortions for the whole frame (referred to as global distortions) and number of bits for a variety of QPs, in a single pass. The estimate of such global distortions was found to be in fact more accurate than that of local distortions, as shown in the rest of this paper.
Iii-a Local estimation of distortion maps
The estimation of distortion maps was performed using a CNN with two inputs. The first input is the original frame data . Only the luminance is considered, namely a matrix of dimension , which is then normalised as follows:
where and , is the bitdepth of the source samples. In addition, a second input is also considered, which consists of a normalised map of QP values (with respect to the maximum QP value , which in HEVC is set to 51), , of dimension , obtained as:
For the training, a set of ground truth distortion maps were used, namely sample-wise maps of absolute differences between the original and reconstructed frame. The goal of the network is to estimate the distortion map . As shown in Fig. 1,
is an CNN formed of residual connections, convolutions, non-linear mapping, down-sampling, up-sampling and skip connections.
initially learns the differences between inputs and outputs, where such difference is modelled in the last layer as an element-wise summation between the output of the previous layer and
. Secondly, convolutional layers use a stride of, and filter sizes of , except the final layer which uses a
filter. Thirdly, non-linear mapping is achieved by adding Parametric Rectified Linear Unit (PReLU)
after each convolutional layer, which increases the flexibility of the network. Max pooling layers adopt a filter size of, the stride is and the output represents one quarter of the input. Up-sampling layers balance the size reduction introduced by max pooling layers. Finally, skip connections serve to aggregate multi-level features, which are modelled by concatenating the features learnt in the 2 and 4 convolutional layers with features learnt in 9 and 7 convolutional layers, respectively.
The loss function used for training is the MSE:
Iii-B Estimation of number of bits and global distortions
An additional CNN was modelled to produce the estimate of the number of bits obtained with an HEVC encoder while intra coding a frame. The CNN takes as input the normalised luminance image data , and is given ground truths in the form of a vector of scalars , where each element is the number of bits necessary to encode the frame with a certain QP value. A total QP values are considered, and therefore is the length of the vector. The goal is to estimate the vector . As shown in Fig. 2, the mapping is a CNN similar to . Nevertheless,
uses Fully Connected (FC) layers that extract meaningful data from features. Moreover, convolutional layers are activated using Rectified Linear Unit (ReLU), and the loss function is the Mean Absolute Error (MAE):
In addition to being used for predicting the number of bits, the same CNN was also trained to predict global average distortions. In this case, each element in the the ground truths is mean of the distortion map between original and reconstructed frame, as obtained when encoding with a given QP value.
The CNNs were trained using the parameters displayed in Table I
. The stop condition was defined in terms of epochs, where an epoch is defined as a complete training obtained by feeding all available samples in the training set to the network. In particular, the training was stopped in case the validation loss did not result in any improvement after additional 10 epochs of training. Furthermore, the loss functions were regularised by adding the-norm of the training variables since on previous training/testing exercises better results were obtained with it.
|Batch size||Optimiser||Learning rate||Weight decay|
|QP 22||QP 27||QP 32||QP 37|
Iv Experimental Results
The CNNs were implemented in TensorFlow and trained on an NVIDIA GeForce GTX 1080 GPU. MS COCO 2017 datasets are used for running the experiments: frames are selected for training, for validation and for testing. The frames are cropped into patches and converted to YUV colour space. The HEVC reference software  (HM 16.9) was used. Four different QPs were considered, namely , , and .
The proposed methods are compared with the work in . The base CNNs were implemented using the description provided within , indicating the usage of linear activations for convolutional layers, training with Adam optimiser, learning rate of and no regularisation. Furthermore, the training was done using a batch size of and the same stop condition as in Section III was used. Additionally, the distortion is computed as the pixel-wise map of absolute differences, instead of SSIM, between original and reconstructed frames. While training the base CNNs, it was noticed that the networks would fluctuate around local minima without stabilising. This behaviour may be due to several factors, including the training dataset not being large enough or the variable updates using a too high learning rate. The proposed CNNs solve this issue by means of considering the regularisation within the loss function.
Results obtained using the CNN are presented here by measuring local correlation between the predicted and real distortion maps. Correlations were computed by squaring and averaging the distortion maps in blocks of different sizes. The values for each block were arranged in two vectors (one for the ground truth, and one for the estimated values, respectively), which were then compared using the Pearson Correlation Coefficient (PCC).
Table II shows a summary of the obtained PCC values in terms of QP and the size of the blocks. It can be noticed that the lower the QP, the lower the correlation between ground truths and estimates, indicating that the CNN predicts more easily in case of generically higher distortions (obtained with high QPs). Moreover, higher correlations are obtained when considering larger block sizes, which can be expected in that even in the case of local distortion estimates, the CNNs are more suitable for predicting global trends. This behaviour is confirmed through a visual comparison as exhibited in Fig. 3. Although the estimated distortion map is not capable of estimating finer details in distortion present in the ground truth, trends in distortion variation are accurately estimated.
Results obtained using the CNN are also presented both in terms of estimating global distortions and bits. These were analysed using the Fréchet distance 
(Euclidean), which measures similarity by calculating the minimum length of leash required to connect two curves. In this case, the distance between the interpolated curve of bpp or average PSNR values over QPs obtained using ground truth and estimations was computed. TablesIII and IV show these results, respectively. Average PSNR values are also reported for the CNN.
When considering estimate of bpp values, results show that the proposed network outperforms the base model, since lower losses and lower Fréchet distances are obtained. Fig. 4 displays bpp predictions per QP for two frames. Although difference can be seen in Fig. 4, there is a strong correlation between ground truths and predictions. Better results are obtained for higher QP values. Similarly, for distortion estimations, lower loss and lower Fréchet distance are obtained using the proposed networks. The predictions for two different frames are displayed in Fig. 5. In general, estimates obtained using are better than those from , confirming that global estimations may be more suitable, unless the application requires local distortions to be available.
This paper presents a CNN-based methodology to estimate distortion and number of bits obtained when intra coding original frames at different quality levels. One CNN is used to estimate vectors of compressed frame sizes or global distortions, whilst another CNN is used to estimate local distortion maps. Using the proposed methodology, these data can be estimated prior to the actual encoding process. Results show, in most cases, estimates are close and very correlated to real values. Future work includes the improvement of the CNNs, as well as the development of a complete bit allocation algorithm for rate-control applications.
The work leading to this paper was co-supported by the Engineering and Physical Sciences Research Council of the UK through an iCASE grant in cooperation with the British Broadcasting Corporation and by the project COGNITUS, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 687605.
-  (2013) Enhanced image super-resolution technique using convolutional neural network. In Advances in Visual Informatics, pp. 157–164. External Links: Cited by: §II.
Restoring an image taken through a window covered with dirt or rain.
2013 IEEE International Conference on Computer Vision, pp. 633–640. External Links: Cited by: §II.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, pp. 2366–2374. Cited by: §II.
-  (2016) Deep learning. MIT Press. Cited by: §II.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. External Links: Cited by: §III-A.
-  (Website) Note: https://hevc.hhi.fraunhofer.de Cited by: §IV.
-  (2015) Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: TABLE I.
-  (2016) Deep learning-based intra prediction mode decision for HEVC. In 2016 Picture Coding Symposium (PCS), External Links: Cited by: §II.
-  (2017) A deep convolutional neural network approach for complexity reduction on intra-mode HEVC. In 2017 IEEE International Conference on Multimedia and Expo (ICME), External Links: Cited by: §II.
-  (2014) Microsoft COCO: common objects in context. In Computer Vision – ECCV 2014, pp. 740–755. External Links: Cited by: §IV.
Rectified linear units improve restricted boltzmann machines. In
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814. External Links: Cited by: §III-B.
-  (2017) Neural network-based arithmetic coding of intra prediction modes in HEVC. In 2017 IEEE Visual Communications and Image Processing (VCIP), Cited by: §II.
-  (2012) Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology 22 (12). External Links: Cited by: §I.
-  (2015) An efficient frame-content based intra frame rate control for high efficiency video coding. IEEE Signal Processing Letters 22 (7), pp. 896–900. External Links: Cited by: §I.
-  (2013) The discrete frechet distance with applications. Montana State University. Cited by: §IV.
-  (2017) CNN-based rate-distortion modeling for H.265/HEVC. In 2017 IEEE Visual Communications and Image Processing (VCIP), External Links: Cited by: §II, §II, §III, §IV.
-  (2018) JVET-IO022-v3: convolutional neural network filter (CNNF) for intra frame. Technical report Cited by: §II.