The super-resolution (SR) which estimates a high-resolution (HR) image from its low-resolution (LR) counterpart is a highly important task in computer vision. SR has attracted much attention from the field of computer vision research and has a wide range of applications[4, 23, 29].
Convolutional neural networks have achieved excellent performance in super-resolution, however there are two main challenges. One is that the high-frequency information lacked in LR image cannot be reconstructed very well. The early neural network can get a good HR image from given an LR image at small scale factors by minimizing the mean squared error (MSE) between the reconstructed image and the ground truth [6, 7, 14, 15, 25]. However, these methods may fail to reconstruct high quality images at large scale factors such as 4. The deep networks such as DRRN , EDSR  and MDSR  can achieve high PSNR in reconstructed image however the details of the high-frequency information are missing. The other challenge is that reconstructed details of images are fabricated. SRGAN  is good at restoring high-frequency information of HR images while the PSNR is relatively low because some of high-frequency information is fabricated and is unfaithful to the ground-truth.
For characters, the details are important, since details often determine whether characters can be recognisable. For example, the characters in the Figure 1(a)-Figure 1(c) are difficult to recognise, and some are mis-recognised. The lack of high-frequency information in general deep networks and the fake high-frequency details of the GAN make details become obstacles to recognise for both computer and human. In another word, these networks would not be appropriate for characters super-resolution. Therefore, it is necessary to propose a new network which is suitable for characters super-resolution.
In this paper, we propose a novel network based on SRGAN , removing the VGG loss while adding a classifier module to classify images reconstructed by the generator. There are some reconstructed images given by various methods in Figure 1. We highlight some pixels in the red cycle to show the difference with the results from SRGAN  and SRResNet. The proposed method called Generative Adversarial Classifier (GAC) can reconstruct images with more high-frequency information than SRResNet and more faithful to groun-truth than SRGAN .
Using the classification loss as additional information so that we can constrain the generator and make the reconstructed images more recognisable. In this sense, our network is similar to Triple-GAN . Triple-GAN also has three parts where the discriminator will decide whether a pair of image and its label come from the true distribution . This distribution discrimination model makes Triple-GAN unable to deal with data with large number of classes (e.g., CASIA-HWDB 1.1 containing 3,755 classes ), since input image and 3,755 classes label into discriminator will occupy a large amount of memory. In contrast, in our network, the discriminator and classifier are only associated with generator, the discriminator does not distinguish the distribution of images and labels, which makes our network: (1) suitable for the problem with large number of classes in particular for Chinese character recognition, (2) much easier to optimise because the parameter number in our discriminator is far fewer than Triple-GAN’s even if our discriminator is significantly deeper.
The scale reconstructed image recognition rate of our network is higher than SRGAN on CASIA-HWDB1.1  with a 8 upscaling factor. In comparison, the top-1 accuracy is 63.95% and top-3 accuracy is 80.69%, the top-1 accuracy and top-3 accuracy of SRGAN is 53.28% and 69.52% respectively. Besides the CASIA-HWDB 1.1, we also evaluated our proposed methods on benchmark data MNIST  and CIFAR-10  111it is not a handwriting dataset, and the experimental results show our method can achieve significantly better results than the present state-of-the-art approaches.
2 Related Work
In this section, we will conduct an overall review of related work, including image super-resolution and Generative Adversarial Nets.
2.1 Image Super-resolution
The research of image super-resolution can be divided into two categories: one is based on single image super-resolution (SISR), and the other is based on multiple image super-resolution (MISR) . Our work can be cast into the first category. We will focus on single image super-resolution (SISR) and will not further discuss approaches that reconstruct HR images from multiple images.
Recently, convolutional neural network (CNN) based SR algorithms have shown excellent performance. In Wang et al. , the authors encoded a sparse representation prior into a feed-forward network architecture based on the learned iterative shrinkage and thresholding algorithm (LISTA) . Dong et al. [5, 6]
used bicubic interpolation to downscale an image as input image and trained a three layer convolutional network end-to-end. The deeply-recursive convolutional network (DRCN) is a highly effective architecture that allows long-range pixel dependencies while keeping the number of model parameters small. Johnson et al.  and Bruna et al. 
proposed a perceptual loss function to reconstruct visually more convincing HR images.
2.2 Generative Adversarial Nets (GAN)
Generative Adversarial Nets (GAN) is proposed by Goodfellow  which contains two parts, a generator and a discriminator. The generator is responsible for generating images close to the real pictures to fool the discriminator, and the discriminator is responsible to discriminate the picture from the generator or real pictures. Adversarial examples problem is also proposed and there are many methods to solve it such as .
In 2016, Radford et al. 
proposed DCGAN which is stable in most settings and shows the vector arithmetics as an intrinsic property of the representations learned by the Generator. Mirza et al. proposed the conditional GAN, the idea is to use labels for some data to help network build salient representations, it can control the generator’s outputs without changing the architecture by adding label as another input to the generator. Ledig et al.  proposed the SRGAN by reconstructing the HR image with GAN based on Resnet  and it achieves remarkable performance in human vision but low PSNR. The Triple-GAN is proposed by Li et al.  which contains three parts, a classifier that (approximately) characterizes the conditional distribution , a class-conditional generator that (approximately) characterizes the conditional distribution in the other direction , and a discriminator that distinguishes whether a pair of data comes from the true distribution , the final goal of Triple-GAN is to predict the labels for unlabeled data as well as to generate new samples conditioned on .
In single image super-resolution, the aim is to estimate a high-resolution from a low-resolution input image . Here the is the low-resolution image of its high-resolution counterpart . In our network, there are labels for . The proposed overall network can be illustrated in Figure 2. The generator generates the reconstructed images from given low-resolution images , the discriminator distinguishes the from , and the classifier gives labels for . The discriminator and classifier are both linked to the generator , trying to guide the generator for generating more realistic yet recognisable reconstructed images.
Our ultimate goal is to train a generating function that estimates a reconstructed image as good as possible for a given LR input image. To achieve this, we train a generator network as a CNN parametrized . For training images , with corresponding , , the SR-specific problem is formulated as:
In this work we will specifically design a loss function as a weighted combination of several loss components.
3.1 Adversarial Network Architecture
Inspired by Goodfellow et al.  and SRGAN , we define a discriminator network which we optimize alternately with the generator , and the optimized object is to solve the adversarial min-max problem:
This formulation follows the basic working principle of GAN. It trains a generator model to try to fool a discriminator which is trained to distinguish super-resolved images from real images. With this approach the generator can learn to reconstruct image more realistic and highly similar to real images, even can make discriminator difficult to discriminate true images from reconstructed images. This approach encourages the result of generator perceptually superior in human vision, and it can achieve preferable visual perception, compared to the traditional method obtained by minimizing pixel-level error measurements such as the Mean Square Error(MSE).
For our generator network and discriminator network , we exploit the SRGAN architecture . The generator network illustrated in Fig. 3(a) are 16 residual blocks with identical layout where the block consists two convolutional layers with small
kernels and 64 feature maps followed by batch-normalization layers
and Parametric ReLU
as the activation function. Before the output layer, We increase the resolution of the image with two trained upsampling blocks which contain one convolutional layer with smallkernels followed by one sub-pixel convolution layer  with and Parametric ReLU as the activation function.
and avoid max-pooling throughout the network. The discriminator network is trained to solve the maximization problem in Equation2. It contains eight convolutional layers with an increasing number of
filter kernels, increasing by a factor of 2 from 64 to 512 kernels. Strided convolutions are used to reduce the image resolution each time the number of features is doubled. The resulted 512 feature maps are followed by two dense layers and a final sigmoid activation function.
For the classifier, we simply apply it with 3 convolutional layers with an increasing number of
filter kernels, increasing by a factor of 2 from 64 to 128 kernels followed by 2 two dense layers and a final softmax activation function to obtain a probability for sample classification as illustrated in Figure3(c).
3.2 Loss Function
The definition of loss function is critical for the performance of our generator network. While is commonly modeled based on the MSE , we design a loss function that assesses a solution with respect to perceptually relevant characteristics. We formulate the loss function as the weighted sum of a content loss and an adversarial loss component and a classification loss component as:
3.2.1 Content Foss
We use the pixel-wise MSE loss as our content loss calculated as:
where and is the low-resolution image and high-resolution image respectively, , , and is the width, height, channel and scale factor, respectively. We describe
by a real-valued tensor of sizeand , by respectively. For character images, can be set to 1 or 3 generally.
This is the most widely used optimization target for image super-resolution. However, while achieving particularly high PSNR, solutions of MSE optimization problems often lack high frequency content; this results in perceptually unsatisfying solutions with overly smooth textures.
3.2.2 Adversarial Loss
Following the GAN architecture, we add the adversarial loss to our loss function. This encourages our network to generate images more natural and realistic in vision, by trying to fool the discriminator network. The adversarial loss is defined based on the probabilities of the discriminator over all training samples as:
Here, is the probability that the reconstructed image is judged as a natural image by the discriminator. For better gradient behavior we minimize this equation instead of the original GAN adversarial loss
3.2.3 Classification Loss
We introduce the third player, i.e., the classifier, into our proposed GAC model, which can characterize the conditional distribution in general network. In our network, the classifier can label correctly for a given reconstructed image, which can be denoted as . We can achieve this simply by minimizing the cross entropy loss as:
In order to make sure that the distribution be as close as possible to the true data distribution , we need another one loss as:
Consequently, we define the overall loss function as:
The detailed algorithm to minimize the overall loss function is given in Algorithm 1.
Sample a batch of pairs of size
Update by ascending along its stochastic gradient:
Update by descending along stochastic gradient:
Update by descending along its stochastic gradient:
Update by descending along its stochastic gradient:
Minibatch stochastic gradient descent training method of Classified-GAN in SSL
4.1 Experimental Set-up
We perform experiments on the widely used handwriting Chinese characters dataset CASIA-HWDB1.1 , handwriting digits dataset MNIST . In addition, to further check if our method could work well in non-text data, we also evaluate our methods on CIFAR-10 .
CASIA-HWDB1.1 consists of 897,758 training samples, and 223,991 testing samples for 3,755 classes. On CASIA-HWDB1.1 and CIFAR-10 datasets, experiments are performed with a scale factor of between low- and high-resolution images from to . On MNIST, we set that scale factor is from to . We also implemented other super-resolution methods include bicubic, SRResNet, SRGAN  and Triple-GAN  and compared them on the three benchmark datasets. Our code will be uploaded to GitHub once this paper is published.
We obtained the LR images by downsampling the HR images using bicubic kernel with downsampling scale factor on CASIA-HWDB1.1 and CIFAR-10, and on MNIST. For each mini-batch we pick 128 random HR images of distinct training images. Note that we can apply the generator model to images of arbitrary size as it is fully convolutional. The MSE loss was thus calculated on images of intensity range . We employed the trained MSE-based SRResNet network as initialization for the generator when training the actual GAN to avoid undesired local optima.
As we mentioned above, the performance of the network is not easily measured by the human eyes. For character recognition, the simplest performance test method is to recognise the reconstructed character image with to the classifier and exploit the recognition accuracy as the evaluation standard. We train a simple classifier for measurement, which contains 3 Convolutional Layers and 2 dense layers. can achieve the 89.29% in top 1 accuracy on CASIA-HWDB1.1.
For the SRResNet and SRGAN, we first train these two networks on CASIA-HWDB1.1 training set by using the images as HR images and downsampling these images to as input. Then we get the test reconstructed images by downsampling CASIA-HWDB1.1 test set to as input. Finally, we can use the to recognise the reconstructed test images.
For our proposed GAC model, we use two strategies to train it. The first strategy is to initialize to , then freeze network so that its parameters are not updated. In this way, plays the same role in the network as VGG used in SRGAN, restricting the distribution trending to the ground-truth distribution . The second strategy is to initialize to , and network updated its parameters during training. In this strategy, it makes the distribution deviate from the ground-truth distribution , but becomes more suitable for generator. The hyper-parameter in Equation 8 was tuned empirically and we choose the best one on a validation set.
4.2 Experimental Results
We report the experimental results on CASIA-HWDB1.1. in Table 1. As clearly observed, our proposed GAC model achieves significantly better performance than all the comparison algorithms. In particular, the GAC model without fixing achieves the top-1 accuracy of , around higher than SRGAN, the best of the other competitive algorithms. On the other hand, a simplified version of GAC with fixed also leads to significant improvement over SRGAN. Note that, on CASIA-HWDB1.1, we did not report the performance of Triple-GAN since it is intractable to be trained on the large category data CASIA-HWDB1.1 with classes due to its inherit nature.
|GAC(fixed , =0.001)||58.24||74.03|
To further check the sensitivity of the proposed GAC on the hyper-parameter as defined in Equation 8, we also report the recognition performance against different on CASIA-HWDB1.1. These results can be seen in Figure 4. We can observe that, though the proposed GAC network is insensitive to the hyper-parameter in general, smaller values may usually lead to better performance. In contrast, GAC (fixed C) is more sensitive to than GAC, and the smaller is, the more network close to SRGAN.
On MNIST and CIFAR-10 we also apply both two above-mentioned strategies, but do not need to initialize to , since is enough to train the model well for 10-class datasets. For Triple-GAN, since its original purpose is not suitable for super-resolution, we remove the label from input of the generator so as to adapt to our task. The results are reported in Table 2. From Table 1 and 2, we can see that three components combined training including our proposed GAC model and Triple-GAN can improve the recognition accuracy substantially. Furthermore, our proposed GAC model outperforms Triple-GAN with a and higher accuracy respectively on MNIST and CIFAR-10. These results validates the effectiveness of the proposed GAC model.
|GAC(fixed , =0.001)||93.50||42.68|
We propose a new three-player generative adversarial classifier (GAC) with three components, a generator, a discriminator and a classifier, particularly for the purpose of character super-resolution. Specifically, involving additionally a classifier in the training process of normal GANs, GAC is calibrated for learning suitable structures and restored characters images that benefits the classification. Our empirical results on CASIA-HWDB1.1, MNIST, CIFAR-10 datasets demonstrate that GAC can achieve the state-of-the-art classification results for character super-resolution.
The work was partially supported by the following: National Natural Science Foundation of China under no. 61473236 and 61876155; The Natural Science Foundation of the Jiangsu Higher Education Institutions of China under no. 17KJD520010; Suzhou Science and Technology Program under no. SYG201712, SZS201613; Natural Science Foundation of Jiangsu Province BK20181189, 17KJB520041; Key Program Special Fund in XJTLU under no. KSF-A-01, KSF-A-10, KSF-P-02.
-  S. Borman and R. L. Stevenson, Super-resolution from image sequences-a review, in Circuits and Systems, 1998. Proceedings. 1998 Midwest Symposium on, IEEE, 1998, pp. 374–378.
-  J. Bruna, P. Sprechmann, and Y. LeCun, Super-resolution with deep convolutional sufficient statistics, arXiv preprint arXiv:1511.05666, (2015).
-  L. Chongxuan, T. Xu, J. Zhu, and B. Zhang, Triple generative adversarial nets, in Advances in Neural Information Processing Systems, 2017, pp. 4091–4101.
D. Chowdhuri, K. Sendhil Kumar, M. R. Babu, and C. P. Reddy,
Very low resolution face recognition in parallel environment, IJCSIT) International Journal of Computer Science and Information Technologies, 3 (2012), pp. 4408–4410.
-  C. Dong, C. C. Loy, K. He, and X. Tang, Learning a deep convolutional network for image super-resolution, in European Conference on Computer Vision, Springer, 2014, pp. 184–199.
-  , Image super-resolution using deep convolutional networks, IEEE transactions on pattern analysis and machine intelligence, 38 (2016), pp. 295–307.
-  C. Dong, C. C. Loy, and X. Tang, Accelerating the super-resolution convolutional neural network, in European Conference on Computer Vision, Springer, 2016, pp. 391–407.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems, 2014, pp. 2672–2680.
K. Gregor and Y. LeCun, Learning fast approximations of sparse
, in Proceedings of the 27th International Conference on International Conference on Machine Learning, Omnipress, 2010, pp. 399–406.
K. He, X. Zhang, S. Ren, and J. Sun,
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
, Deep residual
learning for image recognition
, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167, (2015).
-  J. Johnson, A. Alahi, and L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in European Conference on Computer Vision, Springer, 2016, pp. 694–711.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654.
-  , Deeply-recursive convolutional network for image super-resolution, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1637–1645.
-  A. Krizhevsky, V. Nair, and G. Hinton, The cifar-10 dataset, online: http://www. cs. toronto. edu/kriz/cifar. html, (2014).
-  Y. LeCun, C. Cortes, and C. Burges, Mnist handwritten digit database, AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2 (2010).
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, arXiv preprint, (2016).
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, Enhanced deep residual networks for single image super-resolution, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, vol. 1, 2017, p. 3.
-  C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, Online and offline handwritten chinese character recognition: benchmarking on new databases, Pattern Recognition, 46 (2013), pp. 155–162.
-  C. Lyu, K. Huang, and H.-N. Liang, A unified gradient regularization family for adversarial examples, in Data Mining (ICDM), 2015 IEEE International Conference on, IEEE, 2015, pp. 301–309.
-  M. Mirza and S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, (2014).
-  K. Nasrollahi and T. B. Moeslund, Super-resolution: a comprehensive survey, Machine vision and applications, 25 (2014), pp. 1423–1468.
-  A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434, (2015).
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
-  , Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
-  Y. Tai, J. Yang, and X. Liu, Image super-resolution via deep recursive residual network, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2017.
-  Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, Deep networks for image super-resolution with sparse prior, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 370–378.
-  Q. Yang, R. Yang, J. Davis, and D. Nistér, Spatial-depth super resolution for range images, in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE, 2007, pp. 1–8.