Scene text recognition is a fundamental and important task in computer vision, since it is usually a key step towards many downstream text-related applications, including document retrieval, card recognition, license plate recognition, etc[39, 38, 47, 3]
. Scene Text recognition has achieved remarkable success due to the development of Convolutional Neural Network (CNN).
Many accurate and efficient methods have been proposed for most constrained scenarios (e.g., text in scanned copies or network images). Recent works focus on texts in natural scenes [28, 29, 7, 31, 41, 48, 45, 46], which is much more challenging due to the high diversity of texts in blur, orientation, shape, and low-resolution. A thorough survey of recent advantages of text recognition can be found in  . Modern text recognizers have achieved impressive results on clear text images. However, their performances drop sharply when recognizing low-resolution text images . The main difficulty to recognize LR text is that the optical degradation blurred the shape of the characters. Therefore, it would be promising if we introduce SR methods as a pre-processing procedure before recognition. To our surprise, none of the real dataset and corresponding methods focus on scene text SR.
generate LR counterparts of the high-resolution (HR) images by simply applying uniform degradation like bicubic interpolation or blur kernels. Unfortunately, real blur scene text images are more varied in degradation formation. Scene texts are of arbitrary shapes, distributed illumination, and different backgrounds. Super-resolution on scene text images is much more challenging. Therefore, the proposed TextZoom, which contains paired LR and HR text images of the same text content, is very necessary. The TextZoom dataset is cropped from the newly proposed SISR datasets[5, 50]. Our dataset has three main advantages. (1) This dataset is well annotated. We provide the direction, the text content and the original focal length of the text images. (2) The dataset contains abundant text from different natural scenes, including street views, libraries, shops, vehicle interiors and so on. (3) The dataset is carefully divided into three subsets by difficulty. Experiments on TextZoom demonstrate that our TSRN largely improves the recognition accuracy of CRNN by over 13% compared to synthetic SR data. The annotation and allocation strategy will be briefly introduced in section 3 and demonstrated in detail in supplementary materials.
Moreover, to reconstruct low-resolution text images, we propose a text-oriented end-to-end method. Traditional SISR methods only focus on reconstruct the detail of exture and only satisfy human’s visual perception. However, scene text SR is quite a special task since it contains high-level text content. The fore-and-aft characters have information relations with each other. Obviously, a single blur character will not disable human to recognize the whole word if other characters are clear. To solve this task, firstly, we present a Sequential Residual Block to model recurrent information in text lines, which enabling us to build a correlation in the fore-and-aft characters. Secondly, we propose a boundary-aware loss termed gradient profile loss to reconstructing the sharp boundary of the characters. This loss helps us to distinguish between the characters and backgrounds better and generate a more explicit shape. Thirdly, the misalignment of the paired images is inevitable due to the inaccuracy of the cameras. We propose a central alignment module to make the corresponding pixels more aligned. We evaluate the recognition accuracy by two steps: (1) Do super-resolution with different methods on LR text images; (2) Evaluate the SR text images with trained Text Recognizers e.g. ASTER, MOCAN and CRNN. Extensive experiments show our TSRN clearly outperforms 7 state-of-the-art SR methods in boosting the recognition accuracy of LR images in TextZoom. For example, it outperforms LapSRN by over 5% and 8% on recognition accuracy of ASTER and CRNN. Our results suggest that low-resolution text recognition in the wild is far from being solved, thus more research effort is needed.
The contributions of this work are therefore three-fold:
We introduce the first real paired scene text SR dataset TextZoom with different focal lengths. We annotate and allocate the dataset with three subsets: easy, midumn and hard, respectively.
We prove the superiority of the proposed dataset TextZoom by comparing and analyzing the models trained on synthetic LR and proposed LR images. We also prove the necessity of scene text SR from different aspects.
We propose a new text super-resolution network with three novel modules. It surpasses 7 representative SR methods clearly by training and testing them on TextZoom for fair comparisons.
2 Related work
Super-resolution aims to output a plausible high-resolution image that is consistent with a given low-resolution image. Traditional approaches, such as bilinear, bicubic or designed filtering, leverage the insight that neighboring pixels usually exhibit similar colors and generate the output by interpolating between the colors of neighboring pixels according to a predefined formula. In the deep learning era, super-resolution is treated as a regression problem, where the input is the low-resolution image, and the target output is the high-resolution image[8, 23, 26, 25, 27, 51, 24]. A deep neural net is trained on the input and target output pairs to minimize some distance metric between the prediction and the ground truth. These works are mainly trained and evaluated on those popular datasets [2, 49, 33, 15, 34, 44]. In these datasets, LR images are generated by a down-sample interpolation or Gaussian blur filter. Recently, several works capture LR-HR images pairs by adjusting the focal length of the cameras [5, 50, 6]. In [5, 6], a pre-processing method is applied to reduce the misalignment between the captured LR and HR images While in , a contextual bilateral loss is proposed to leverage the misalignment.
In this work, a new dataset TextZoom is proposed, which fills in the absence of paired scene text SR dataset. It is well annotated and allocated with difficulty. We hope it can serve as a challenging benchmark.
Text Recognition. Early work adopts a bottom-up fashion  which detects individual characters firstly and integrates them into a word, or a top-down manner , which treats the word image patch as a whole and recognizes it as a multi-class image classification problem. Considering that scene text generally appears as a character sequence, CRNN 
regard it as a sequence recognition problem and employs Recurrent Neural Network (RNNs) to model the sequential features. CTC
loss is often combined with the RNN outputs for calculating the conditional probability between the predicted sequences and the target[28, 29]. Recently, an increasing number of recognition approaches based on the attention mechanism have achieved significant improvements [7, 31]. ASTER 
rectified oriented or curved text based on Spatial Transformer Network(STN) and then performed recognition using an attentional sequence-to-sequence model.
Scene Text Image Super-Resolution.
Some previous works conducted on scene text image super-resolution are aimed at improving the recognition accuracy and image quality evaluation metrics. compared the performance of several artificial filters on down-sampled text images.  propose a convolution-transposed convolution architecture to deal with binary document SR.  adapt SRCNN  in text image SR in the ICDAR 2015 competition TextSR  and achieved a good performance, but no text-oriented method was proposed.
These works take a step on low-resolution text recognition, but they only train on down-sampled images, learning to regress a simple mapping function of inverse-bicubic (or bilinear) interpolation. Since all the LR images are identically generated by a simple down-sample formulation, it is not well-generalized to real text images.
3 TextZoom Dataset
Data Collection & Annotation. Our proposed dataset TextZoom comes from two state-of-the-art SISR datasets: RealSR  and SRRAW . These two newly proposed datasets consist of paired LR-HR images captured by digital cameras.
RealSR  is captured by four focal lengths with two digital cameras: Canon 5D3 and Nikon D810. In RealSR , these four focal lengths of images are allocated as ground-truth, 2X LR images, 3X LR images, 4X LR images separately.
SR-RAW is collected by seven different focal lengths with SONY FE camera, range from 24-240mm. The images captured in shorted focal lengths could be used as LR images while those captured in longer lengths as corresponding ground-truth. For SR-RAW, we annotate the bounding box of the words on the 240mm focal length images.
For RealSR, we annotate the bounding box of the words on the 105mm focal length images. We labeled the images with the largest focal length of each group and cropped the text boxes from the rest following the same rectangle. So the misalignment is unavoidable. There are some top-down or vertical text boxes in the annotated results. In this task, we rotate all of these images to horizontal for better recognition. There are only a few curved text images in our dataset. For each pair of LR-HR images, we provide the annotation of the case sensitive character string (including punctuation), the type of the bounding box, and the original focal lengths. We demonstrate the detailed annotation principle of the text images cropped from SR-RAW and RealSR in detail in supplementary materials.
Selected by height. The size of the cropped text boxes is diverse, e.g. height from 7 to 1700 pixels, so it is not suitable to treat the text images cropped from the same focal lengths as a same domain. We define our principle following these considerations. (1)Patch or not. In SISR, data are usually generated by cropping patches from the original images [26, 25, 10, 5, 50]. Text images could not be cut into patches since the shape of the characters should maintain completed. (2) Accuracy distribution. We divide the text images by height and test the accuracy (Refer to the Tables showed in supplementary materials). We found that the accuracy does not increase obviously when the height is larger than 32 pixels. Setting images to 32 pixels height is also a customary rule in scene text recognition research [40, 7, 31]. The accuracy of the images smaller than 8 pixels are too low, which hardly has any value for super-resolution, so we discard the images the height of which is less than 8 pixels. (3) Number. We found that in the cropped text images, the height range from 8 to 32 claim the majority. (4) No down-sample. Since the interpolation degradation should not be introduced into real blur images, we could only up-sample the LR images to a relatively bigger size.
Following these 4 considerations, we up-sample the images ranging from 16-32 pixels height to 32 pixels height, and up-sample the images ranging from 8-16 pixels height to 16 pixels height. We conclude that (16, 32) should be a good pair to form a 2X train set for scene text SR task. For example, the text images taken from 150mm focal length and height sized in 16-32 pixels would be taken as a ground-truth for the 70mm counterpart. So we selected all the images the height of which range from 16 pixels to 32 pixels as our ground-truth image and up-sample them to the size of 12832 (widthheight), and the corresponding 2X LR images to the size of 6416 (widthheight). For this task, we only generate this 2X LR-HR pair dataset from the annotated text images mainly due to the special characteristics of text recognition. Other scale of factors of our annoted images could be used for different purpose.
Allocation of TextZoom. The SR-RAW and RealSR are collected by different cameras with different focal lengths. The distance from the objects also affect the legibility of the images. So the dataset should be further divided following their distribution.
The train-set and test-set are cropped from the original train-set and test-set in SR-RAW and RealSR separately. The author of SR-RAW used larger distance from the camera to the subjects to minimize the perspective shift . So the accuracy of text images from SR-RAW is relatively lower under the similar focal lengths compared to RealSR. The accuracy of the images cropped from 100mm focal lengths in SR-RAW is 52.1% tested by ASTER , while the accuracy of those from 105mm in RealSR is 75.0% tested by ASTER  (Refer to the Tables showed in supplementary materials). With the same height, the images of smaller focal lengths are more blurred. With this in mind, we allocate our dataset into three subsets by difficulty. The LR images cropped from RealSR render easy. The LR images from SR-RAW and the focal lengths of which larger than 50mm are viewed as medium. The rest are as hard.
In this task, our main purpose is to increase the recognition accuracy
of the easy, medium and hard subsets. We also show the results of peak signal to noise ratio (PSNR) and structural similarity index (SSIM) in the supplementary materials.
Dataset Statistic The scene text images in TextZoom contain abundant text words, including common English words, mixed character strings, and digits. The shape and direction of the bounding boxes are also diverse. The detailed statistics of TextZoom is shown in supplementary materials. We list and analysis the type of the bounding boxes, the distribution of characters and the type of the text contents.
In this section, we present our proposed method TSRN in detail. Firstly, we briefly describe our pipeline in section 4.1. Then we demonstrate the proposed Sequential Residual Block. Thirdly, we introduce our central alignment module. Finally, we introduce a new gradient profile loss to sharpen the text boundaries.
Our baseline is SRResNet . As shown in Fig. 3, we mainly make two modifications to the structure of SRResNet: 1) add a central alignment module in front of the network; 2) replace the original basic blocks with the proposed Sequential Residual Blocks (SRBs). In this work, we concatenate the binary mask with RGB image as our input. The binary masks are simply generated by calculating the mean gray scale of the image. The detailed information of masks is shown in supplementary materials. During training, firstly, the input is rectified by central alignment module. Then we use CNN layers to extracted shallow features from the rectified image. Stacking five SRBs, we extract deeper and sequential dependent feature and do shortcut connection following ResNet . The SR images are finally generated by up-sampling block and CNN. We also design a gradient prior loss () aiming at enhancing the shape boundary of the characters. The output of the network is supervised by MSELoss () and our proposed gradient profile loss ().
4.2 Sequential Residual Block
Previous state-of-the-art SR methods mainly pursue better performance in PSNR and SSIM. Traditional SISR only cares texture recontrustion and ignore context information while text images have strong sequential characteristics. Our ultimate goal is to train a SR network that can reconstruct the context information of text images. In text recognition tasks, scene text images encode the context information for text recogntion by Recurrent Neural Network (RNN) . Inspired from them, we modified the residual blocks  by adding Bi-directional LSTM (BLSTM) mechanism. Inspired by , we build sequence connectionist in horizontal lines and fused the feature into deeper channels. Different from , we build the in-network recurrence architecture not for detecting but for low-level reconstruction, so we only adapt the idea of building text line sequence dependence. In Fig. 3, the SRB is briefly illustrated. Firstly, we extract feature by CNN. Then permute and resize the feature map as the horizontal text line can be encoded into sequence. Then the BLSTM can propagate error differentials , and invert the feature maps into feature sequences, and feed them back to the convolutional layers. To make the sequence dependent robust for tilted text images, we introduce the BLSTM from two directions, horizontal and vertical. BLSTM takes the horizontal and vertical convolutional feature as sequential inputs, and updates its internal state recurrently in the hidden layer.
Here denotes the hidden layers, denotes the input features, separately denote the recurrent connection from horizontal and vertical direction.
4.3 Central Alignment Module
The misalignment make the pixel-to-pixel losses, such as and generate significant artifacts and double shadows. This mainly due to the misalignment of the pixels in training data. Sine some of the text pixels in LR images are in spatial corresponding to the background pixels in the HR images, the network could learn a wrong pixel-wise counterpart information. As mentioned in Section. 3, the text regions in HR images are more central aligned compared to the LR images. So we introduce STN as our central alignment module. The STN is a spatial transform network which can rectify the images and be learned end-to-end. Since most of the misalignment of the text regions are merely horizontal or vertical translation, we adopt affine transformation as the transform manipulation. Once the text regions in LR images are aligned adjacent the center, the pixel-wise losses would make better performance and the artifacts could be relieved. We show more detailed information of central alignment module in supplementary materials.
4.4 Gradient Profile Loss
Gradient Profile Prior (GPP) is proposed in  to generate sharper edge in SISR task.  proposed a transformation method on gradient field. This method squeeze the curve of gradient profiles following a ratio and transform the image to a sharper version. This method is proposed before the deep learning era, so it merely make the curve of gradient field sharper without supervision.
Since we have a paired text super-resolution dataset, we could use the gradient field of HR images as ground-truths. In Total Variation Loss , gradient field is used to remove the noise by minimizing . This would make the texture more smooth in scene images, so it could be used in SISR . Our serves the opposite function of . It is not a smoothness constraint.
Generally, text images merely contain two colors: characters and backgrounds. This means that there does not exist complex texture in text images, what we should care is only the boundaries between characters and backgrounds. So the better image quality means the sharper boundaries rather than smooth ones of characters. Sometimes the gradient field is not exactly the boundary between the backgrounds and characters when the backgrounds are not pure color. But most of the cases satisfy our purpose and are useful for our training.
We revisit the GPP and generate ground truth from HR images, then we define the loss function as below:
denotes the gradient field of HR images, and denotes that of SR images.
Our proposed exhibits two advantageous properties: (1) The gradient field vividly show the characteristics of text images: the texts and backgrounds. (2) The LR images always come with wider curve of gradient field, while HR images mean thinner curve. And the curve of gradient field could be easily generated through mathematical calculation. This ensures a confidential supervision label.
We train the SR methods on our proposed TextZoom (see section 3.) training set. We evaluate our models on our three subsets easy, medium and hard. To avoid down-sample degradation, all the LR images are up-sampled to 6416, and HR images to 12832.
In this work, we do not evaluate our models on the popular datasets like ICDAR2015 , ICDAR2013 , etc. mainly due to their different distribution. The small text images in these datasets are due to interpolation down-sample rather than capturing distance or focal length. We describe this reason in detail in supplementary materials.
5.2 Implementation Details
During training, we set the trade-off weight of loss as 1 and as 1
. We use the Adam optimizer with momentum term 0.9. When evaluating recognition accuracy, we use the official pytorch version code of ASTER, and the released model from the github link111https://github.com/ayumiymk/aster.pytorch. In supplementary materials, we use the official pytorch code and released model of CRNN222https://github.com/meijieru/crnn.pytorch and MORAN333https://github.com/Canjie-Luo/MORAN$_$v2
. All the SR models are trained by 500 epochs with 4 NVIDIA GTX 1080ti GPUs. The batch-size is adapted as the setting in the original papers.
5.3 Synthetic LR vs. TextZoom LR
To demonstrate the superiority of paired scene text SR images, we compare the performance of the models trained on synthetic datasets and our TextZoom dataset. Traditional SISR tasks simply down-sample HR image by bicubic interpolation to generate corresponding LR images. To illustrate the superiority of real LR over synthetic LR, we train our model on the bicubic down-sampled LR images and real LR images to show the performance.
|Method||train data||Accuracy of ASTER ||Accuracy of MORAN ||Accuracy of CRNN |
We selected SRResNet , LapSRN  and our proposed method TSRN, and trained them on the synthetic LR and real LR datasets for a 2X model respectively. We trained 6 models in all and evaluated them on our proposed TextZoom subsets. From Table 2, we can figure that the three methods trained on real LR (TextZoom) dataset outperform the models trained on synthetic LR obviously in accuracy. For our TSRN, the model trained on real LR could surpass the synthetic LR for nearly 9.0% on ASTER and MORAN, and nearly 14.0% on CRNN.
5.4 Ablation Study on TSRN
In order to study the effect of each component in TSRN, we gradually modify the configuration of our network and compare their differences to build a best network. For brevity, we only compare the accuracy of ASTER .
|Configuration||Accuracy of ASTER |
|2||5SRBs + align||74.8%||55.7%||39.6%||57.8%|
|3||5SRBs + align (Ours)||75.1%||56.3%||40.1%||58.3%|
1) SRBs. We add BLSTM mechanism to the basic residual block in SRResNet  and get the proposed SRB. The SRB is the essential component in TSRN. Comparing 0 and 1 in Table 3 , stacking 5 SRBs, we can boost up the average accuracy by 4.9% compared to SRResNet . There are many partially blurred text images. They look similar to us when combined with fore-and-aft characters, but they are indeed unrecognizable when taken apart. The SRBs could learn the sequential similarity in the text lines and construct a better text shape. In supplementary materials, we further discuss about the SRB and our 4-channel inputs. We stack different number of SRBs and modify the layer of hidden units. And we find that stacking 5 SRBs and 32 hidden units, and concatenate binary masks with RGB channels, the network saturates and could get the best performance.
2) Central Alignment Module. Central alignment module can boost the average accuracy by 1.5%, as shown in Table 3 method 2. From Fig. 5, we can find that without central alignment module, the artifacts are strong, and the characters are twisted. While with more appropriate alignment, we could generate higher quality images since the pixel-wise loss function could supervise the training better. In supplementary materials, we prove the generalization of central alignment module by plugging it into SRResNet , LapSRN  and TSRN, and figure that it can boost up the accuracy on all these three SR methods.
3) Gradient Profile Loss. From Table 3 method 3, we can find the proposed gradient profile loss can boost the average accuracy by 0.5%. Although the increase is slight, the visual results are better (Fig. 5 method 3). With this loss, some twist-shaped characters are more explicit, like characters ‘e’ ‘s’ ‘f’. And the boundary between characters can be figured (See word ‘naturelles’, ‘supervisor’, ‘While’ in Fig. 5).
5.5 Comparison with State-of-the-Art
To prove the effectiveness of TSRN, we compare it with 7 SISR methods on our TextZoom dataset, including SRCNN , VDSR , SRResNet , RRDB , EDSR , RDN  and LapSRN . All of the networks are trained on our TextZoom training set and evaluated on our three testing subsets.
In Table 4, we list the recognition accuracy tested by ASTER , MORAN , and CRNN  of all the mentioned 7 methods, along with BICUBIC and the proposed TSRN. In Table 4, it can be observed that TSRN outperforms all the 7 SISR methods in recognition accuracy sharply. Although these 7 SISR methods could achieve a relatively good accuracy, what we should pay attention to is the gap between SR results and BICUBIC. These methods could improve the average accuracy 2.3 5.8, while ours could improve 10.714.6%. We can also find that our TSRN could improve the accuracy on all of the three state-of-the-art recognizers. In the supplementary materials, we show the results of PSNR and SSIM and show that our TSRN could also surpass most of the state-of-the-art methods in PSNR and SSIM.
|Method||Loss Function||Accuracy of ASTER ||Accuracy of MORAN ||Accuracy of CRNN |
|Improvement of TSRN||10.4%||13.9%||8.9%||11.1%||9.5%||15.4%||7.1%||10.7%||16.1%||17.1%||10.3%||14.6%|
6 Conclusion and Discussion
In this work, we verify the importance of scene text image super-resolution task. We proposed the TextZoom dataset, which is, to the best of our knowledge, the first real paired scene text image super-resolution dataset. The TextZoom is well annotated and allocated and divided into three subset:easy, medium and hard. Through extensive experiments, we demonstrated the superiority of real data over synthetic data. To tackle text images super-resolution task, we build a new text-oriented SR method with three novel modules. Our TSRN clearly outperforms 7 SR methods with large gap. It also shows low-resolution text SR and recognition is far from being solved, thus more research effort is needed.
In the future, we will capture more appropriately distributed text images. Extremely large and small images will be avoided. The images should also contain more kinds of languages, such as Chinese, Korean and Japanese. We will also focus on new methods such as introducing recognition attention into the text super-resolution task.
Appendix 0.A Is Scene Text Image Super-Resolution Necessary?
0.a.1 Training Recognizer on Low-Resolution Images.
It is assumed that we could achieve better performance on recognizing low-resolution (LR) text images if we directly train the recognition networks on small size images, and then the super-resolution procedure could be removed. This query is reasonable because the deep neural networks have a strong robustness on the training domains. To refute this query and prove the necessity of super-resolution for text images, we compare the recognition accuracy of three methods:
|TSRN(ours) + ASTER(Released)||75.1%||56.3%||40.1%||58.3%|
From Table 5, we can figure that the re-implemented model do increase the accuracy sharply on the LR images. The average accuracy of BICUBIC can be increased by 5.4%, from 47.2% to 52.6%. (Note that our re-implemented model also boost up the performance of other scene text recognition datasets, so part of the 5.4% of increase is the result of a better converged model.) But it is still much lower than the accuracy of our SR results (TSRN(ours) + ASTER(Released)). So the SR methods could be a effective and convenient pre-processing procedure of scene text recognition.
0.a.2 Speed Accuracy.
In this task, we take the recognition accuracy as the most import evluation metric. To figure out whether it is wise to increase the accuracy at the cost of the extra computation consumption of TSRN, we compare the number of parameter, FLOPs and inference FPS of w and w/o super-resolution. The inference FPS means the FPS of recognizing the text images w or w/o SR. Through Table 6, we can find that the proposed method is relatively tiny compared to the recognition network. The FPS of ‘with TSRN’ is nearly equal to direct recognition of attention based recognizer ASTER  and MORAN . The FPS of CTC based recognizer CRNN decreases when adding the TSRN, but the improvement of accuracy is very considerable. So it would be a suitable manipulation to take super-resolution as a pre-processing procedure before recognition. (All of the FPSs were tested on a single GTX 1080Ti GPU with the same batch-size of 50.)
|Computation Cost Analysis|
|Recognizer||TSRN(ours)||Average Accuracy||FLOPs||Parameters||Inference FPS|
|58.3% (+10.1%)||4.72G + 0.72G||20.99M + 2.8M||21.67|
|54.8% (+10.7%)||0.73G + 0.72G||20.3M+2.8M||59.6|
|41.4% (+14.6%)||0.64G + 0.72G||8.3M + 2.8M||340.6|
Appendix 0.B Extensive Experiments on Our Method
0.b.1 Binary Mask.
In text images, the characters are usually in a unified color. The only texture information is the character color and background color. For brevity, we concatenate the binary mask with text images as input (Fig. 7). The character regions render 1 and the background regions render 0. This input can be viewed as a transcendental semantic segmentation label of text images since most of the text images only contain 2 colors: the text color and background color. The masks are simply generated by calculating the average gray scale of the RGB images.
|Ablation Study of Masks|
0.b.2 Discussion about SRB.
To build the best architecture of SRB, we gradually modify this two essential configuration: the number of hidden units and the number of blocks. Our method select 5 SRB with 32 hidden units each. In this section, we do ablation study on this two component separately.
1) Hidden Units. The BLSTMs are used to build sequence dependence in the text lines, so we hypothesize that more hidden units could get better performance. By the experiments, we compare 0, 16, 32, 64, 128 of hidden layers. 0 Hidden Units represents SRResNet. The results demonstrate that the network would achieve best accuracy when the number of hidden unit equal 32 (Table 9) Too many hidden units achieve lower performance since it already build the sequence-dependence well.
|Ablation Study of Hidden Units|
|Ablation Study of SRBs|
2) Block Number. To figure out whether we can achieve better performance by building deeper network, we stack different number of SRBs to compare the performance. In Table 9, we compare our method with 4, 5, 6, 7 SRBs. We can find that more SRBs may not boost up the performance. The accuracy of 7 SRBs even decrease obviously. Stacking 5 SRBs, the network saturates and could get the best performance.
Our configuration of Sequence Residual Block is then shown in Table 10.
|Convolution||#maps:64, k:33, s:1 p:1|
|Convolution||#maps:64, k:33, s:1 p:1|
|Convolution||#maps:64, k:11, s:1 p:0|
|Short Cut Connection|
0.b.3 Psnr Ssim.
To calculate the PSNR[dB] and SSIM, we borrow the code from 444https://github.com/open-mmlab/mmsr. From Table 11, our PSNR of medium and hard subsets are not so good because PSNR is pixel-to-pixel calculated, while SSIM is calculated with a 1111 sliding kernel. The central alignment module would introduce slight pixel shift so the PSNR is somewhat lower than other SR methods. Usually, PSNR and SSIM could not represent the visual quality of the images , in this task, it is also not so important compared to accuracy.
Appendix 0.C Central Alignment Module
Our central alignment module is based on Spatial Transformation Network . The network predicts a set of control points and then then image is rectified by a Thin-Plate-Spline(TPS)  transformation. Our central alignment module mainly use horizontal or vertical shift. But sometimes the background region need different transformation scale to let the character region more central placed. So we use TPS transformation here to let the transformation flexible. As shown in Figure. 8, the transformation is different between different points.
0.c.1 Performance on Manual Enlarged Misalignment.
We can find from ablation study that the central alignment module could improved the average accuracy for less than 2.0%. Indeed, it can perform better on more misaligned text image pairs. To prove that, we do data augmentation aiming at generating more misaligned image pairs. We crop our dataset TextZoom using a box with a 90% width and 90% height of the original image size randomly slide on the LR image, and get a region of 90%90% image. The HR images are not cropped. We train on the cropped dataset and evaluate on TextZoom. In Table 12, we show the performance of central alignment module on our manual cropped misalignment data. From the results in Table.12, we can find that the accuracy could be sharply improved.
We show visulization results in Fig. 9. The third row is the SR images trained without alignment. We can find that the double shadow and artifacts are very serve when trained without central alignment module. We can find that many words are still correctly recognized even with strong double shadow.
0.c.2 Plugged into Other SR Methods.
In this study, we compare the performance of w or w/o central alignment module on our dataset. (Table 13). We display the performance of six models: w and w/o Central Alignment Module on SRResNet, LapSRN and ours seperately. The improvement of central align on these methods illustrate that it is a conveniently pluggable module for SR networks, and all the performance could be improved.
0.c.3 Comparison with CoBi Loss.
. It modiﬁed the nearest neighbor search and considers local contextual similarities with weighted spatial awareness. The CoBi Loss used pre-trained VGG-19 features and select several conv layers as deep features. Its formulation is shown in Eqn.3 4 5. The results are shown in Table 14. It is less practical in this task because the pre-trained model is trained on a classification dataset.
Appendix 0.D Detailed Information of TextZoom.
0.d.1 Annotation of SR-RAW and RealSR.
SR-RAW  is collected by seven different focal lengths with SONY FE camera, ranging from 24-240mm. We demonstrate it in Fig. 10. There are totally 500 images in SR-RAW dataset, where 450 in train set and 50 in test set. The images are then aligned via field of view (FOV) matching and geometric transformation. The images captured in shorted focal lengths could be used as LR images while those captured in longer lengths as corresponding ground-truth. The author of SR-RAW  applied down-sample operation as offset when the ratio does not match precisely. For example, when use (35mm, 150mm) pairs to train a 4X model, the 150mm images should be down-sampled to 140mm at first. In our project, we follow this strategy in our dataset pre-processing. We annotate all the images taken from 240mm focal length which contains recognizable text in SR-RAW dataset. AS showed in Fig. 10, the focal length decreases from left to right, from 240mm to 24mm. The smaller the focal length, the smaller the field of view. The annotated text images have the same text contexts but different resolutions. We display three groups in Fig. 10: ‘STAR’, ‘QUEST’, ‘510-401-4657’. In this image, the text images cropped from 35mm and 24mm are hardly recognizable. How many clear images in a group of 7 images mainly depends on the height of original box in the 240mm focal lengths images.
In Table 15, we show the information of the cropped text images in SR-RAW. In the original images, Some groups of images do not have the 7th image, so the number of 24mm is less than the others. Through the table we can figure out that the recognition accuracy decreases obviously as the resolution degrades. We use the released ASTER  model to test the accuracy.
|Text Images in Train Set of SR-RAW|
|Original Image Number||393||393||393||393||393||393||365|
|Text Box Number||9160||9160||9160||9160||9160||9160||8119|
|Text Images in Test Set of SR-RAW|
|Original Image Number||50||50||50||50||50||50||48|
|Text Box Number||1734||1734||1734||1734||1734||1734||1630|
RealSR  is captured by two Digital Single Lens Reflex(DSLR) cameras: Canon 5D3 and Nikon D810 with four focal lengths: 105mm, 50mm, 35mm, and 28mm. In RealSR , the images taken by 105mm focal length are used to generate HR images, while images taken by 50mm, 35mm, 28mm are used to generate 2X, 3X, 4X LR images separately. separately. For convenience, we only crop the 105mm, 50mm and 28mm. The non-horizontal text images are rotated to the most suitable angle for recognition (see Fig. 11).
In Table 11, we briefly show the statistics of text images in RealSR.
|Text Images in RealSR|
|Text Box Number||6048||6048||6048|
Align. In RealSR, the author aligned the image pairs by introducing a pixel-wise registration algorithm which take luminance difference into consideration. In SR-RAW , the Euclidean motion model is used as the pre-processing procedure. During training, a contextual bilateral loss is proposed to leverage the misalignment, but a pre-trained model is needed, and it brings high computation consumption. We adapted their proposed pre-processing method to align the original images and cropped our dataset following our annotation principal. While in training, we used central alignment module as replacement.
0.d.2 Accuracy by Height.
The size of the cropped text boxes is diversed, We can figure that with the similar focal length, the accuracy of text images in RealSR is much higher than that in SR-RAW (Table 11 15). This mainly due to that the SR-RAW images are taken from a longer distance. So it is suitable to allocate images cropped from RealSR as subset easy.
We divided the previous cropped images by height and found that the accuracy is relatively good when the height reaches 16-32 pixels, which is showed in Table 17. The images sized in (16-32) and (8-16) claim the majority in all the groups. The accuracy of the images smaller than 8 pixels are too low, which hardly have any value for restoration. The images are hardly recognizable, so we discard the images the height of which is less than 8 pixels. (8-16, 16-32) should be a good pair to form a 2X train set for STR super-resolution task. For example, the text images taken from 150mm focal length and height sized in 16-32 pixels would be taken as a ground-truth for the 70mm counterpart. So we selected all the images the height of which range from 16 pixels to 32 pixels as our ground-truth image and up-sample them to the size of 12832 (widthheight), and the corresponding 2X LR images to the size of 6416 (widthheight).
|Recognition Accuracy of images in different height|
0.d.3 Statistical information.
We display some useful statistical information in Fig. 12
. (a) Our dataset contains abundant characters and digits, including some punctuation. (b) Most of the lengths of the words range from 1-8 characters. (c) There are many randomly-placed boxes and books in the original images, so we count the direction type of the bounding boxes we annotated. ‘Horizontal’ means that the text image is horizontal placed, easy to read. ‘Vertical(+)’ denotes the text image is vertical and it should be rotated following the clockwise direction for 90 degrees, while ‘Vertical(-)’ denotes following anti-clockwise direction for 90 degrees. ‘Top-down’ denotes that the text image should be rotated 180 degrees for the best recognition. ‘Curve’ denote the text image is curved. ‘Ignored’ means that the text is illegal (not digits, English letters or punctuation). (d) Via the generic lexicon which has 90k common words used in ICDAR2015, we figure that 57.5% of the text contents are common English words. Plate includes car license plates, door number plates or street signs. They are the combination of digits, punctuation and letters. This kind of text account for 12% because there are many street views in the original images. Uncommon word claims 18.2% in all the texts. This kind of text are mainly rare words, phrases or compound words. Other meaningless strings like punctuation, single letter and digits account for the rest.
0.d.4 Task Analysis.
Our dataset is challenging mainly for two reasons: the misalignment and ambiguity. Misalignment is unavoidable during data capture when the lens zoom in and out. Any slight camera movement could cause tens of pixels shift, especially the short focal lengths. And the pre-processing procedure cannot totally eliminate misalignment. We display some example images in Fig. 13.
From Fig. 13, we can figure that the misalignment varies and no specific regulation can be found since we do not have pixel-level annotation of the word location. The three different subsets are allocated appropriately by the difficulty. The misalignment and ambiguity becomes server as the difficulty increases. Note that the characters in HR images tend to locate in the center compared to those in LR. This mainly owing to that when we annotated the HR images, we artificially keep the text boxes at the centre of the images.
-  (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. arXiv preprint arXiv:1904.01906. Cited by: §1.
-  (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, Cited by: §2.
-  (2019) Robust license plate recognition using neural networks trained on synthetic images. Pattern Recognition. Cited by: §1.
-  (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Appendix 0.C.
-  (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In ICCV, Cited by: §0.D.1, §1, §2, §3, §3, §3.
-  (2019) Camera lens super-resolution. In CVPR, Cited by: §2.
-  (2017) Focusing attention: towards accurate text recognition in natural images. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, §2, §3.
-  (2015) Image super-resolution using deep convolutional networks. TPAMI. Cited by: Table 11, §1, §2, §2, §5.5, Table 4.
-  (2015) Boosting optical character recognition: a super-resolution approach. arXiv preprint arXiv:1506.02211. Cited by: §2.
-  V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.) (2018) ECCV. Cited by: §3.
-  (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. Int. Conf. Mach. Learn., Cited by: §2.
-  (2016) Synthetic data for text localisation in natural images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 3rd item, Table 5.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
-  (2016) Reading scene text in deep convolutional sequences. In AAAI, Cited by: §4.2.
-  (2015) Single image super-resolution from transformed self-exemplars. In CVPR, Cited by: §2.
-  (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: 3rd item, Table 5.
-  (2016) Reading text in the wild with convolutional neural networks. IJCV. Cited by: §2.
-  (2015) Spatial transformer networks. In Proc. Adv. Neural Inf. Process. Syst., Cited by: Appendix 0.C, §2, §4.3.
-  (2014) Deep features for text spotting. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In Proc. Eur. Conf. Comp. Vis., Cited by: Table 4.
-  (2015) ICDAR 2015 competition on robust reading. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: §0.D.3, §5.1.
-  (2013) ICDAR 2013 robust reading competition. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: §5.1.
-  (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.
-  (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.3, §5.4, §5.5, Table 4.
-  L. Leal-Taixé and S. Roth (Eds.) (2019) Computer vision - ECCV 2018 workshops - munich, germany, september 8-14, 2018, proceedings, part V. Lecture Notes in Computer Science. Cited by: Table 11, §1, §2, §3, §5.5, Table 4.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §0.B.3, Table 11, §1, §2, §3, §4.1, §4.2, §4.4, §5.3, §5.4, §5.4, §5.5, Table 4.
-  (2017) Enhanced deep residual networks for single image super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.
-  (2016) STAR-net: a spatial attention residue network for scene text recognition.. In Proc. Brit. Mach. Vis. Conf., Cited by: §1, §2.
-  (2018) Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In Proc. AAAI Conf. on Arti. Intel., Cited by: §1, §2.
-  (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: §1.
-  (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recognition. Cited by: §0.A.2, Table 6, Figure 2, §1, §2, §2, §3, §5.5, Table 2, Table 4.
-  (2007) An introduction to super-resolution text. In Digital document processing, Cited by: §2.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Cited by: §2.
-  (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications. Cited by: §2.
-  (2018) The contextual loss for image transformation with non-aligned data. In ECCV, Cited by: §0.C.3.
-  (2018) Binary document image super resolution for improved readability and ocr performance. arXiv preprint arXiv:1812.02475. Cited by: §2.
-  (2015) ICDAR2015 competition on text image super-resolution. In ICDAR, Cited by: §2.
-  (2019) An end-to-end trainable framework for joint optimization of document enhancement and recognition. In ICDAR,, Cited by: §1.
-  (2019) A set of benchmarks for handwritten text recognition on historical documents. Pattern Recognition. Cited by: §1.
-  (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Table 6, Figure 2, §2, §2, §3, §4.2, §5.5, Table 2, Table 4.
-  (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: 1st item, 2nd item, 3rd item, §0.A.2, Table 6, Figure 9, §0.D.1, Figure 2, Table 1, §1, §2, §2, §3, Figure 6, §5.4, §5.5, Table 2, Table 3, Table 4.
-  (2011) Gradient profile prior and its applications in image super-resolution and enhancement. TIP. Cited by: §4.4.
-  (2016) Detecting text in natural image with connectionist text proposal network. In ECCV, Cited by: §4.2.
-  (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In CVPRW, Cited by: §2.
-  (2019) Shape robust text detection with progressive scale expansion network. In CVPR, Cited by: §1.
-  (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. Cited by: §1.
-  (2017) Improving handwritten chinese text recognition using neural network language models and convolutional neural network shape models. Pattern Recognition. Cited by: §1.
-  (2019) Scene text detection with supervised pyramid context network. In AAAI, Cited by: §1.
-  (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, Cited by: §2.
-  (2019) Zoom to learn, learn to zoom. In CVPR, Cited by: §0.C.3, §0.D.1, §0.D.1, §1, §2, §3, §3, §3.
-  (2018) Residual dense network for image super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.