Image compression with deep learning systems is an active area of research that recently has becomes very compelling respect to the modern natural images codecs as JPEG2000, , BPG  WebP currently developed by 
. The new deep learning methods are based on an auto-encoder architecture where the features maps, generate from a Convolutional Neural Networks (CNN) encoder, are passed through a quantizer to create a binary representation of them, and subsequently given in input to a CNN decoder for the final reconstruction. In this view, several encoders and decoders models have been suggested as a ResNet
style network with the parametric rectified linear units (PReLU), generative approach build on GANs 
or with a innovative hybrid networks made with Gated Recurrent Units (GRUs) and ResNet. In contrast, this paper proposes a super-resolution approach, build on a modifying version of SRGAN , where downsampling JPEG images are converted at High Resolution (HR) images. Hence, in order to improve the final PSNR results, a Reinforcement Learning (RL) approach is used to indirectly maximize the PSNR function with an A3C policy  end-to-end joined with SRGAN. The main contributions of this works are: (i) Propose a compression pipeline based on JPEG image downsampling combined with a super-resolution deep network. (ii) Suggest a new way for maximizing not differentiable metrics through RL. However, even if the PSNR metric is a fully differentiable function, the proposed method could be used in future applications for non-euclidean distance such as in the Dynamic Time Warping (DTW) algorithms .
In this section is given a more formal description of the suggested system which includes: the network architecture and the losses used for training.
2.1 Network Architecture
The architecture consists of three main blocks: encoder, decoder and a discriminator (Figure 1). The is the low-resolution (LR) input image (i.e compressed with a JPEG encoder) of size ( i.e with C color channels and W, H the image width and height); where a bicubic downsampling operation with factor is applied. While the output is an HR image defined as .
The encoder is basically a ResNet , where the first convolution block has a kernel size of and Feature Maps (FM) with a activation function. Then, five Residual Blocks (RB) are stacked together. Each of those RB consists of two convolution layers with kernel size and FM followed by Batch-Normalisation (BN) and . After that, a final convolution block of and FM are repeated. However, the encoder is also joint with a fully connected layer and, at each training iterations, produces an action prediction of the actual PSNR; together with a value function (i.e explained in 2.2.2 section).
The decoder is fundamentally another deep network that allows increasing the resolution of the output encoder with eight subpixel layers .
The encoder, joint with the decoder, define a generator , where are the weight and biases parameters for each L-layers for the specific network. A third network , called discriminator, is also optimized concurrently with for solving the following adversarial min-max problem:
The idea behind loss is to train a generative model to fool . Indeed, the discriminator is trained to distinguish super-resolution images , generated by , from those of the training dataset. In this way, the discriminator is increasingly struggled to distinguish the images (generated by ) from the real ones and consequently driving the generator to produce results closer to the HR training images. Then, in the proposed model, the discriminator is parameterized through a VGG network with LeakyReLU activation
2.2 Loss function
The accurate definition of the loss function is crucial for the performance of thegenerator. Here, the paragraph is logically divided into three losses: the SRGAN loss, the RL loss, and the proposed loss.
2.2.1 SRGAN loss
The SRGAN loss is determined as a combination of three other separate losses: MSE loss, VGG loss, and GANs loss. Where the MSE loss is defined as:
based on the ReLU activation function of a 19 layer VGG (defined here as) network:
Where and are the dimension of image in the MSE loss. Whilst, for the VGG loss, they are the output FM dimensions. While the GANs loss is previously defined in the equation 1. Finally, the total SRGAN loss is determined as:
2.2.2 RL loss
The aim of RL loss is to indirectly maximize the PSNR through an actor-critic approach . Given a map between the low resolution input and the current PSNR value prediction (see fig. 1). Thence, at each training iterations, is calculated the reward value as a threshold between the previous at iteration and that one to iteration as follows:
where the function is defined as:
The is the maximum pixel value of the HR image, is the output encoder HR image, while is the corresponding HR ground truth for each pixel at HR size. The reward (eq. 5), actually depends on the action taken by the policy for two main reasons: (i) during the training process the
becomes an optimal estimator of the decoder output(used in 6). (ii) The latent space between the encoder and the fully connected layer is the same and share equal policy information. Thus, all the rewards are accumulated every training steps through the following return function:
where is a discount factor. Therefore, is possible to define the function as an expectation of given the input and .
To notice, the encoder, together with the fully connected layer, become the policy network . This policy network is parametrized by the standard method on the encoder parameters with the following gradient direction:
It can be consider an unbiased estimation of
. Especially, to reduce the variance of this evaluation (and keeping it unbiased) is desirable to subtract, from the return function, a baselinecalled value function. The total policy agent gradient is given by:
The term can be considered a estimation of the advantage to predict for a given input. Consequently, a learnable evaluation of the value function is used: . This approach, is further called generative actor-critic  becouse the prediction is the actor while the baseline is its critic. The RL loss is then calculated as:
2.2.3 Proposed loss
The Proposed Loss (PL) combines both SRGAN loss and RL loss. After every step (i.e due to the rewards accumulation process at each training iterations), the is added on .
2.3 Experiments and Results
In this section is evaluate the method suggested. The dataset used is the CLIC compression dataset  correspondingly divided in the train, valid and test sets. The train has HR images, valid and test
. The evaluation metrics used are the PSNR and MS-SSIM for both valid and test. An ADAM optimizer is used with a learning rate of within model iterations until convergence. The Reinforcement Learning SRGAN (RL-SRGAN) is compared with the SRGAN model work 
and the Lanczos resampling (i.e a smooth interpolation through a convolution between theimage and a stretched function). Finally, the table 1 highlights that the PSNR difference between LANCZOS upsampling and RL-SRGAN is 0.9, while of 0.19 with SRGAN; whereas the MS-SSIM remains constant between RL-SRGAN and SRGAN for the validation set. This also shows a better accuracy for the RL-SRGAN model. While, for the tests, RL-SRGAN achieve of PSNR and of MS-SSIM. Furthermore, the compression rate for the validation set images is 3.812.623 bytes respect 362.236.068 bytes of original HR dataset. While for the test set images is 5.228.411 bytes in contrast with the 5.882.850.012 bytes of the original one. That makes the method a good trade-off between compression capacity and acceptable PSNR.
A modified version of SRGAN is suggested where an A3C method is joined with GANs. Sadly, the proposed method has strong limitations due to the drastic downsampling of the input JPEG image. This downsampling causes loss of information, difficult to recover from the super-resolution network, which leads to lower results in PSNR and MS-SSIM on the test set (i.e and respectively). Despite, the results (table 1) emphasize slight improvement performances for RL-SRGAN related within SRGAN and a baseline LANCZOS upsampling filter. However, the proposed method compresses all test files in a parsimonious way respect to the challenge methods. Indeed, the total dimension of the compression test set is of 5236870 bytes respect to 15748677 bytes of CLIC 2019 winner. Finally, a new method for maximizing non-differentiable functions is here suggested through deep reinforcement learning technique.
-  David S. Taubman and Michael W. Marcellin. JPEG2000 : image compression fundamentals, standards, and practice / David S. Taubman, Michael W. Marcellin. Kluwer Academic Publishers Boston, 2002.
-  Fabrice bellard. bpg image format. https://bellard.org/bpg.
-  Webp image format. https://developers.google.com/speed/webp.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
-  Haojie Liu, Tong Chen, Qiu Shen, Tao Yue, and Zhan Ma. Deep image compression via end-to-end learning. In , June 2018.
-  Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Extreme learned image compression with gans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen,
Joel Shor, and Michele Covell.
Full resolution image compression with recurrent neural networks.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 105–114, 2017.
-  Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
-  Eamonn J. Keogh and Michael J. Pazzani. Scaling up dynamic time warping for datamining applications. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 285–289. ACM, 2000.
-  Andrew P. Aitken, Christian Ledig, Lucas Theis, Jose Caballero, Zehan Wang, and Wenzhe Shi. Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. CoRR, abs/1707.02937, 2017.
-  Workshop and challenge on learned image compression (clic). http://www.compression.cc/.
-  Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. Multi-scale structural similarity for image quality assessment. pages 1398–1402, 2003.