Scene Text Image Super-Resolution in the Wild

05/07/2020 ∙ by Wenjia Wang, et al. ∙ 0

Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Recognizing low-resolution text images is challenging because they lose detailed content information, leading to poor recognition accuracy. An intuitive solution is to introduce super-resolution (SR) techniques as pre-processing. However, previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images (e.g.Bicubic down-sampling), which is simple and not suitable for real low-resolution text recognition. To this end, we pro-pose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images which are captured by cameras with different focal length in the wild. It is more authentic and challenging than synthetic data, as shown in Fig. 1. We argue improv-ing the recognition accuracy is the ultimate goal for Scene Text SR. In this purpose, a new Text Super-Resolution Network termed TSRN, with three novel modules is developed. (1) A sequential residual block is proposed to extract the sequential information of the text images. (2) A boundary-aware loss is designed to sharpen the character boundaries. (3) A central alignment module is proposed to relieve the misalignment problem in TextZoom. Extensive experiments on TextZoom demonstrate that our TSRN largely improves the recognition accuracy by over 13 by nearly 9.0 our TSRN clearly outperforms 7 state-of-the-art SR methods in boosting the recognition accuracy of LR images in TextZoom. For example, it outperforms LapSRN by over 5 results suggest that low-resolution text recognition in the wild is far from being solved, thus more research effort is needed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 12

page 14

page 17

page 19

page 21

page 22

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene text recognition is a fundamental and important task in computer vision, since it is usually a key step towards many downstream text-related applications, including document retrieval, card recognition, license plate recognition, etc 

[39, 38, 47, 3]

. Scene Text recognition has achieved remarkable success due to the development of Convolutional Neural Network (CNN).

Many accurate and efficient methods have been proposed for most constrained scenarios (e.g., text in scanned copies or network images). Recent works focus on texts in natural scenes [28, 29, 7, 31, 41, 48, 45, 46], which is much more challenging due to the high diversity of texts in blur, orientation, shape, and low-resolution. A thorough survey of recent advantages of text recognition can be found in [30] . Modern text recognizers have achieved impressive results on clear text images. However, their performances drop sharply when recognizing low-resolution text images [1]. The main difficulty to recognize LR text is that the optical degradation blurred the shape of the characters. Therefore, it would be promising if we introduce SR methods as a pre-processing procedure before recognition. To our surprise, none of the real dataset and corresponding methods focus on scene text SR.

In this paper, we propose a paired scene text SR dataset, termed TextZoom, which is the first dataset focus on real text SR. Previous scene text SR methods [8, 23, 26, 27, 25, 51, 24]

generate LR counterparts of the high-resolution (HR) images by simply applying uniform degradation like bicubic interpolation or blur kernels. Unfortunately, real blur scene text images are more varied in degradation formation. Scene texts are of arbitrary shapes, distributed illumination, and different backgrounds. Super-resolution on scene text images is much more challenging. Therefore, the proposed TextZoom, which contains paired LR and HR text images of the same text content, is very necessary. The TextZoom dataset is cropped from the newly proposed SISR datasets  

[5, 50]. Our dataset has three main advantages. (1) This dataset is well annotated. We provide the direction, the text content and the original focal length of the text images. (2) The dataset contains abundant text from different natural scenes, including street views, libraries, shops, vehicle interiors and so on. (3) The dataset is carefully divided into three subsets by difficulty. Experiments on TextZoom demonstrate that our TSRN largely improves the recognition accuracy of CRNN by over 13% compared to synthetic SR data. The annotation and allocation strategy will be briefly introduced in section 3 and demonstrated in detail in supplementary materials.

Moreover, to reconstruct low-resolution text images, we propose a text-oriented end-to-end method. Traditional SISR methods only focus on reconstruct the detail of exture and only satisfy human’s visual perception. However, scene text SR is quite a special task since it contains high-level text content. The fore-and-aft characters have information relations with each other. Obviously, a single blur character will not disable human to recognize the whole word if other characters are clear. To solve this task, firstly, we present a Sequential Residual Block to model recurrent information in text lines, which enabling us to build a correlation in the fore-and-aft characters. Secondly, we propose a boundary-aware loss termed gradient profile loss to reconstructing the sharp boundary of the characters. This loss helps us to distinguish between the characters and backgrounds better and generate a more explicit shape. Thirdly, the misalignment of the paired images is inevitable due to the inaccuracy of the cameras. We propose a central alignment module to make the corresponding pixels more aligned. We evaluate the recognition accuracy by two steps: (1) Do super-resolution with different methods on LR text images; (2) Evaluate the SR text images with trained Text Recognizers e.g. ASTER, MOCAN and CRNN. Extensive experiments show our TSRN clearly outperforms 7 state-of-the-art SR methods in boosting the recognition accuracy of LR images in TextZoom. For example, it outperforms LapSRN by over 5% and 8% on recognition accuracy of ASTER and CRNN. Our results suggest that low-resolution text recognition in the wild is far from being solved, thus more research effort is needed.

The contributions of this work are therefore three-fold:

  1. We introduce the first real paired scene text SR dataset TextZoom with different focal lengths. We annotate and allocate the dataset with three subsets: easy, midumn and hard, respectively.

  2. We prove the superiority of the proposed dataset TextZoom by comparing and analyzing the models trained on synthetic LR and proposed LR images. We also prove the necessity of scene text SR from different aspects.

  3. We propose a new text super-resolution network with three novel modules. It surpasses 7 representative SR methods clearly by training and testing them on TextZoom for fair comparisons.

2 Related work

Super-Resolution.

Super-resolution aims to output a plausible high-resolution image that is consistent with a given low-resolution image. Traditional approaches, such as bilinear, bicubic or designed filtering, leverage the insight that neighboring pixels usually exhibit similar colors and generate the output by interpolating between the colors of neighboring pixels according to a predefined formula. In the deep learning era, super-resolution is treated as a regression problem, where the input is the low-resolution image, and the target output is the high-resolution image 

[8, 23, 26, 25, 27, 51, 24]. A deep neural net is trained on the input and target output pairs to minimize some distance metric between the prediction and the ground truth. These works are mainly trained and evaluated on those popular datasets [2, 49, 33, 15, 34, 44]. In these datasets, LR images are generated by a down-sample interpolation or Gaussian blur filter. Recently, several works capture LR-HR images pairs by adjusting the focal length of the cameras [5, 50, 6]. In [5, 6], a pre-processing method is applied to reduce the misalignment between the captured LR and HR images While in [50], a contextual bilateral loss is proposed to leverage the misalignment.

In this work, a new dataset TextZoom is proposed, which fills in the absence of paired scene text SR dataset. It is well annotated and allocated with difficulty. We hope it can serve as a challenging benchmark.

Text Recognition. Early work adopts a bottom-up fashion [19] which detects individual characters firstly and integrates them into a word, or a top-down manner [17], which treats the word image patch as a whole and recognizes it as a multi-class image classification problem. Considering that scene text generally appears as a character sequence, CRNN [40]

regard it as a sequence recognition problem and employs Recurrent Neural Network (RNNs) to model the sequential features. CTC 

[11]

loss is often combined with the RNN outputs for calculating the conditional probability between the predicted sequences and the target 

[28, 29]. Recently, an increasing number of recognition approaches based on the attention mechanism have achieved significant improvements [7, 31]. ASTER [41]

rectified oriented or curved text based on Spatial Transformer Network(STN) 

[18] and then performed recognition using an attentional sequence-to-sequence model.

In this work, we choose state-of-the-art recognizer ASTER [41], MORAN [31] and CRNN [40] as baseline recognizers to evaluate the recognition accuracy of the SR images.

Scene Text Image Super-Resolution.

Some previous works conducted on scene text image super-resolution are aimed at improving the recognition accuracy and image quality evaluation metrics.  

[32] compared the performance of several artificial filters on down-sampled text images.  [36] propose a convolution-transposed convolution architecture to deal with binary document SR.  [9] adapt SRCNN [8] in text image SR in the ICDAR 2015 competition TextSR [37] and achieved a good performance, but no text-oriented method was proposed.

These works take a step on low-resolution text recognition, but they only train on down-sampled images, learning to regress a simple mapping function of inverse-bicubic (or bilinear) interpolation. Since all the LR images are identically generated by a simple down-sample formulation, it is not well-generalized to real text images.

3 TextZoom Dataset

Data Collection & Annotation. Our proposed dataset TextZoom comes from two state-of-the-art SISR datasets: RealSR [5] and SRRAW [50]. These two newly proposed datasets consist of paired LR-HR images captured by digital cameras.

RealSR [5] is captured by four focal lengths with two digital cameras: Canon 5D3 and Nikon D810. In RealSR [5], these four focal lengths of images are allocated as ground-truth, 2X LR images, 3X LR images, 4X LR images separately.

SR-RAW is collected by seven different focal lengths with SONY FE camera, range from 24-240mm. The images captured in shorted focal lengths could be used as LR images while those captured in longer lengths as corresponding ground-truth. For SR-RAW, we annotate the bounding box of the words on the 240mm focal length images.

For RealSR, we annotate the bounding box of the words on the 105mm focal length images. We labeled the images with the largest focal length of each group and cropped the text boxes from the rest following the same rectangle. So the misalignment is unavoidable. There are some top-down or vertical text boxes in the annotated results. In this task, we rotate all of these images to horizontal for better recognition. There are only a few curved text images in our dataset. For each pair of LR-HR images, we provide the annotation of the case sensitive character string (including punctuation), the type of the bounding box, and the original focal lengths. We demonstrate the detailed annotation principle of the text images cropped from SR-RAW and RealSR in detail in supplementary materials.

Selected by height. The size of the cropped text boxes is diverse, e.g. height from 7 to 1700 pixels, so it is not suitable to treat the text images cropped from the same focal lengths as a same domain. We define our principle following these considerations. (1)Patch or not. In SISR, data are usually generated by cropping patches from the original images [26, 25, 10, 5, 50]. Text images could not be cut into patches since the shape of the characters should maintain completed. (2) Accuracy distribution. We divide the text images by height and test the accuracy (Refer to the Tables showed in supplementary materials). We found that the accuracy does not increase obviously when the height is larger than 32 pixels. Setting images to 32 pixels height is also a customary rule in scene text recognition research [40, 7, 31]. The accuracy of the images smaller than 8 pixels are too low, which hardly has any value for super-resolution, so we discard the images the height of which is less than 8 pixels. (3) Number. We found that in the cropped text images, the height range from 8 to 32 claim the majority. (4) No down-sample. Since the interpolation degradation should not be introduced into real blur images, we could only up-sample the LR images to a relatively bigger size.

Following these 4 considerations, we up-sample the images ranging from 16-32 pixels height to 32 pixels height, and up-sample the images ranging from 8-16 pixels height to 16 pixels height. We conclude that (16, 32) should be a good pair to form a 2X train set for scene text SR task. For example, the text images taken from 150mm focal length and height sized in 16-32 pixels would be taken as a ground-truth for the 70mm counterpart. So we selected all the images the height of which range from 16 pixels to 32 pixels as our ground-truth image and up-sample them to the size of 12832 (widthheight), and the corresponding 2X LR images to the size of 6416 (widthheight). For this task, we only generate this 2X LR-HR pair dataset from the annotated text images mainly due to the special characteristics of text recognition. Other scale of factors of our annoted images could be used for different purpose.

Allocation of TextZoom. The SR-RAW and RealSR are collected by different cameras with different focal lengths. The distance from the objects also affect the legibility of the images. So the dataset should be further divided following their distribution.

The train-set and test-set are cropped from the original train-set and test-set in SR-RAW and RealSR separately. The author of SR-RAW used larger distance from the camera to the subjects to minimize the perspective shift [50]. So the accuracy of text images from SR-RAW is relatively lower under the similar focal lengths compared to RealSR. The accuracy of the images cropped from 100mm focal lengths in SR-RAW is 52.1% tested by ASTER [41], while the accuracy of those from 105mm in RealSR is 75.0% tested by ASTER [41] (Refer to the Tables showed in supplementary materials). With the same height, the images of smaller focal lengths are more blurred. With this in mind, we allocate our dataset into three subsets by difficulty. The LR images cropped from RealSR render easy. The LR images from SR-RAW and the focal lengths of which larger than 50mm are viewed as medium. The rest are as hard.

In this task, our main purpose is to increase the recognition accuracy

of the easy, medium and hard subsets. We also show the results of peak signal to noise ratio (PSNR) and structural similarity index (SSIM) in the supplementary materials.

Dataset Statistic The scene text images in TextZoom contain abundant text words, including common English words, mixed character strings, and digits. The shape and direction of the bounding boxes are also diverse. The detailed statistics of TextZoom is shown in supplementary materials. We list and analysis the type of the bounding boxes, the distribution of characters and the type of the text contents.

4 Method

In this section, we present our proposed method TSRN in detail. Firstly, we briefly describe our pipeline in section 4.1. Then we demonstrate the proposed Sequential Residual Block. Thirdly, we introduce our central alignment module. Finally, we introduce a new gradient profile loss to sharpen the text boundaries.

4.1 Pipeline

Figure 3: The illustration of our proposed TSRN. We concatenate binary mask with RBG channels as a RGBM 4-channel input. The input is recitified by central alignment module and then fed into our pipeline. The output is the super-resolved RGB image. The outputs are supervised by . The RGB channels of the outputs are supervised by .

Our baseline is SRResNet [26]. As shown in Fig. 3, we mainly make two modifications to the structure of SRResNet: 1) add a central alignment module in front of the network; 2) replace the original basic blocks with the proposed Sequential Residual Blocks (SRBs). In this work, we concatenate the binary mask with RGB image as our input. The binary masks are simply generated by calculating the mean gray scale of the image. The detailed information of masks is shown in supplementary materials. During training, firstly, the input is rectified by central alignment module. Then we use CNN layers to extracted shallow features from the rectified image. Stacking five SRBs, we extract deeper and sequential dependent feature and do shortcut connection following ResNet [13]. The SR images are finally generated by up-sampling block and CNN. We also design a gradient prior loss () aiming at enhancing the shape boundary of the characters. The output of the network is supervised by MSELoss () and our proposed gradient profile loss ().

4.2 Sequential Residual Block

Previous state-of-the-art SR methods mainly pursue better performance in PSNR and SSIM. Traditional SISR only cares texture recontrustion and ignore context information while text images have strong sequential characteristics. Our ultimate goal is to train a SR network that can reconstruct the context information of text images. In text recognition tasks, scene text images encode the context information for text recogntion by Recurrent Neural Network (RNN) [14]. Inspired from them, we modified the residual blocks [26] by adding Bi-directional LSTM (BLSTM) mechanism. Inspired by [43], we build sequence connectionist in horizontal lines and fused the feature into deeper channels. Different from [43], we build the in-network recurrence architecture not for detecting but for low-level reconstruction, so we only adapt the idea of building text line sequence dependence. In Fig. 3, the SRB is briefly illustrated. Firstly, we extract feature by CNN. Then permute and resize the feature map as the horizontal text line can be encoded into sequence. Then the BLSTM can propagate error differentials [40], and invert the feature maps into feature sequences, and feed them back to the convolutional layers. To make the sequence dependent robust for tilted text images, we introduce the BLSTM from two directions, horizontal and vertical. BLSTM takes the horizontal and vertical convolutional feature as sequential inputs, and updates its internal state recurrently in the hidden layer.

(1)

Here denotes the hidden layers, denotes the input features, separately denote the recurrent connection from horizontal and vertical direction.

4.3 Central Alignment Module

The misalignment make the pixel-to-pixel losses, such as and generate significant artifacts and double shadows. This mainly due to the misalignment of the pixels in training data. Sine some of the text pixels in LR images are in spatial corresponding to the background pixels in the HR images, the network could learn a wrong pixel-wise counterpart information. As mentioned in Section. 3, the text regions in HR images are more central aligned compared to the LR images. So we introduce STN[18] as our central alignment module. The STN is a spatial transform network which can rectify the images and be learned end-to-end. Since most of the misalignment of the text regions are merely horizontal or vertical translation, we adopt affine transformation as the transform manipulation. Once the text regions in LR images are aligned adjacent the center, the pixel-wise losses would make better performance and the artifacts could be relieved. We show more detailed information of central alignment module in supplementary materials.

4.4 Gradient Profile Loss

Gradient Profile Prior (GPP) is proposed in [42] to generate sharper edge in SISR task.  [42] proposed a transformation method on gradient field. This method squeeze the curve of gradient profiles following a ratio and transform the image to a sharper version. This method is proposed before the deep learning era, so it merely make the curve of gradient field sharper without supervision.

Since we have a paired text super-resolution dataset, we could use the gradient field of HR images as ground-truths. In Total Variation Loss , gradient field is used to remove the noise by minimizing . This would make the texture more smooth in scene images, so it could be used in SISR [26]. Our serves the opposite function of . It is not a smoothness constraint.

Generally, text images merely contain two colors: characters and backgrounds. This means that there does not exist complex texture in text images, what we should care is only the boundaries between characters and backgrounds. So the better image quality means the sharper boundaries rather than smooth ones of characters. Sometimes the gradient field is not exactly the boundary between the backgrounds and characters when the backgrounds are not pure color. But most of the cases satisfy our purpose and are useful for our training.

(a) Gradient field of HR image.
(b) Gradient field of LR image.
(c) Illustration of .
Figure 4: The illustration of gradient field and gradient profile loss.

We revisit the GPP and generate ground truth from HR images, then we define the loss function as below:

(2)

denotes the gradient field of HR images, and denotes that of SR images.

Our proposed exhibits two advantageous properties: (1) The gradient field vividly show the characteristics of text images: the texts and backgrounds. (2) The LR images always come with wider curve of gradient field, while HR images mean thinner curve. And the curve of gradient field could be easily generated through mathematical calculation. This ensures a confidential supervision label.

The visualized demonstration of the is showed in Fig. 4. With the gradient field of HR images, we can squeeze the curve of gradient profiles to a thinner one without complex mathematical formulation Fig. 4(c).

5 Experiments

5.1 Datasets

We train the SR methods on our proposed TextZoom (see section 3.) training set. We evaluate our models on our three subsets easy, medium and hard. To avoid down-sample degradation, all the LR images are up-sampled to 6416, and HR images to 12832.

In this work, we do not evaluate our models on the popular datasets like ICDAR2015 [21], ICDAR2013 [22], etc. mainly due to their different distribution. The small text images in these datasets are due to interpolation down-sample rather than capturing distance or focal length. We describe this reason in detail in supplementary materials.

5.2 Implementation Details

During training, we set the trade-off weight of loss as 1 and as 1

. We use the Adam optimizer with momentum term 0.9. When evaluating recognition accuracy, we use the official pytorch version code of ASTER, and the released model from the github link

111https://github.com/ayumiymk/aster.pytorch. In supplementary materials, we use the official pytorch code and released model of CRNN222https://github.com/meijieru/crnn.pytorch and MORAN333https://github.com/Canjie-Luo/MORAN$_$v2

. All the SR models are trained by 500 epochs with 4 NVIDIA GTX 1080ti GPUs. The batch-size is adapted as the setting in the original papers.

5.3 Synthetic LR vs. TextZoom LR

To demonstrate the superiority of paired scene text SR images, we compare the performance of the models trained on synthetic datasets and our TextZoom dataset. Traditional SISR tasks simply down-sample HR image by bicubic interpolation to generate corresponding LR images. To illustrate the superiority of real LR over synthetic LR, we train our model on the bicubic down-sampled LR images and real LR images to show the performance.

Method train data Accuracy of ASTER [41] Accuracy of MORAN [31] Accuracy of CRNN [40]
easy medium hard average easy medium hard average easy medium hard average

BICUBIC
64.7% 42.4% 31.2% 47.2% 60.6% 37.9% 30.8% 44.1% 36.4% 21.1% 21.1% 26.8%
SRResNet Syn 66.4% 44.4% 32.4% 48.9% 61.8% 39.6% 31.0% 45.2% 37.4% 21.6% 21.2% 27.3%
Real 69.4% 47.3% 34.3% 51.3% 60.7% 42.9% 32.6% 46.3% 39.7% 27.6% 22.7% 30.6%
LapSRN Syn 66.5% 43.9% 32.2% 48.7% 61.8% 39.0% 30.7% 44.9% 37.5% 21.8% 20.9% 27.3%
Real 71.5% 48.6% 35.2% 53.0% 64.6% 44.9% 32.2% 48.3% 46.1% 27.9% 23.6% 33.3%

TSRN(ours)
Syn 67.5% 45.3% 33.0% 49.7% 61.7% 40.4% 30.6% 45.3% 37.8% 22.0% 21.0% 27.6%
Real 75.1% 56.3% 40.1% 58.3% 70.1% 53.3% 37.9% 54.8% 52.5% 38.2% 31.4% 41.4%
Table 2: The comparison of the models trained on synthetic LR and real LR. The listed results are the models evaluated on proposed TextZoom LR images. For better displaying, we calculated the average accuracy. The recognition accuracies are tested by the official released model of ASTER [41], MORAN [31] and CRNN [40]. ‘Syn’ denotes down-sampled LR and ‘Real’ denotes proposed LR images.

We selected SRResNet [26], LapSRN [24] and our proposed method TSRN, and trained them on the synthetic LR and real LR datasets for a 2X model respectively. We trained 6 models in all and evaluated them on our proposed TextZoom subsets. From Table 2, we can figure that the three methods trained on real LR (TextZoom) dataset outperform the models trained on synthetic LR obviously in accuracy. For our TSRN, the model trained on real LR could surpass the synthetic LR for nearly 9.0% on ASTER and MORAN, and nearly 14.0% on CRNN.

5.4 Ablation Study on TSRN

In order to study the effect of each component in TSRN, we gradually modify the configuration of our network and compare their differences to build a best network. For brevity, we only compare the accuracy of ASTER [41].

Configuration Accuracy of ASTER [41]
Method Loss function easy medium hard average
   0 SRResNet + + 69.6% 47.6% 34.3% 51.3%
   1 5SRBs 74.5% 53.3% 37.3% 56.2%
   2 5SRBs + align 74.8% 55.7% 39.6% 57.8%
   3 5SRBs + align (Ours) 75.1% 56.3% 40.1% 58.3%
Table 3: Ablation study for different settings of our method TSRN. The recognition accuracies are tested by the official released model of ASTER [41].

1) SRBs. We add BLSTM mechanism to the basic residual block in SRResNet [26] and get the proposed SRB. The SRB is the essential component in TSRN. Comparing 0 and 1 in Table 3 , stacking 5 SRBs, we can boost up the average accuracy by 4.9% compared to SRResNet [26]. There are many partially blurred text images. They look similar to us when combined with fore-and-aft characters, but they are indeed unrecognizable when taken apart. The SRBs could learn the sequential similarity in the text lines and construct a better text shape. In supplementary materials, we further discuss about the SRB and our 4-channel inputs. We stack different number of SRBs and modify the layer of hidden units. And we find that stacking 5 SRBs and 32 hidden units, and concatenate binary masks with RGB channels, the network saturates and could get the best performance.

2) Central Alignment Module. Central alignment module can boost the average accuracy by 1.5%, as shown in Table 3 method 2. From Fig. 5, we can find that without central alignment module, the artifacts are strong, and the characters are twisted. While with more appropriate alignment, we could generate higher quality images since the pixel-wise loss function could supervise the training better. In supplementary materials, we prove the generalization of central alignment module by plugging it into SRResNet [26], LapSRN [24] and TSRN, and figure that it can boost up the accuracy on all these three SR methods.

3) Gradient Profile Loss. From Table 3 method 3, we can find the proposed gradient profile loss can boost the average accuracy by 0.5%. Although the increase is slight, the visual results are better (Fig. 5 method 3). With this loss, some twist-shaped characters are more explicit, like characters ‘e’ ‘s’ ‘f’. And the boundary between characters can be figured (See word ‘naturelles’, ‘supervisor’, ‘While’ in Fig. 5).

Figure 5: Visual comparisons for showing the effects of each component in our proposed TSRN. The recognition result strings of ASTER are displayed under each image. Those characters in red denote wrongly recognition.

5.5 Comparison with State-of-the-Art

To prove the effectiveness of TSRN, we compare it with 7 SISR methods on our TextZoom dataset, including SRCNN [8], VDSR [23], SRResNet [26], RRDB [25], EDSR [27], RDN [51] and LapSRN [24]. All of the networks are trained on our TextZoom training set and evaluated on our three testing subsets.

In Table 4, we list the recognition accuracy tested by ASTER [41], MORAN [31], and CRNN [40] of all the mentioned 7 methods, along with BICUBIC and the proposed TSRN. In Table 4, it can be observed that TSRN outperforms all the 7 SISR methods in recognition accuracy sharply. Although these 7 SISR methods could achieve a relatively good accuracy, what we should pay attention to is the gap between SR results and BICUBIC. These methods could improve the average accuracy 2.3 5.8, while ours could improve 10.714.6%. We can also find that our TSRN could improve the accuracy on all of the three state-of-the-art recognizers. In the supplementary materials, we show the results of PSNR and SSIM and show that our TSRN could also surpass most of the state-of-the-art methods in PSNR and SSIM.

Method Loss Function Accuracy of ASTER [41] Accuracy of MORAN [31] Accuracy of CRNN [40]
easy medium hard average easy medium hard average easy medium hard average
BICUBIC 64.7% 42.4% 31.2% 47.2% 60.6% 37.9% 30.8% 44.1% 36.4% 21.1% 21.1% 26.8%
SRCNN [8] 69.4% 43.4% 32.2% 49.5% 63.2% 39.0% 30.2% 45.3% 38.7% 21.6% 20.9% 27.7%
VDSR [23] 71.7% 43.5% 34.0% 51.0% 62.3% 42.5% 30.5% 46.1% 41.2% 25.6% 23.3% 30.7%
SRResNet [26] 69.6% 47.6% 34.3% 51.3% 60.7% 42.9% 32.6% 46.3% 39.7% 27.6% 22.7% 30.6%
RRDB [25] 70.9% 44.4% 32.5% 50.6% 63.9% 41.0% 30.8% 46.3% 40.6% 22.1% 21.9% 28.9%
EDSR [27] 72.3% 48.6% 34.3% 53.0% 63.6% 45.4% 32.2% 48.1% 42.7% 29.3% 24.1% 32.7%
RDN [51] 70.0% 47.0% 34.0% 51.5% 61.7% 42.0% 31.6% 46.1% 41.6% 24.4% 23.5% 30.5%
LapSRN [24] 71.5% 48.6% 35.2% 53.0% 64.6% 44.9% 32.2% 48.3% 46.1% 27.9% 23.6% 33.3%
TSRN(ours) 75.1% 56.3% 40.1% 58.3% 70.1% 53.3% 37.9% 54.8% 52.5% 38.2% 31.4% 41.4%
Improvement of TSRN 10.4% 13.9% 8.9% 11.1% 9.5% 15.4% 7.1% 10.7% 16.1% 17.1% 10.3% 14.6%
Table 4: Performance of state-of-the-art SR methods on the three subsets in TextZoom. For better displaying, we calculated the average accuracy. denotes Mean Average Error (MAE) Loss. denotes Mean Squared Error (MSE) Loss. denotes Total Variation Loss. denotes Perceptual Loss proposed in [20]. denotes the Charbonnier Loss proposed in LapSRN [24]. denotes our proposed Gradient Prior Loss. The recognition accuracies are tested by the official released model of ASTER [41], MORAN [31] and CRNN [40]. Improvement of TSRN in the last line represents the accuracy increase of our SR compared to LR.
Figure 6: Visualization results of state-of-the-art SR methods on our proposed dataset TextZoom. The BICUBIC images are the bicubic up-sampled images of LR. We can find from the BICUBIC pictures that our task is really difficult, because the LR inputs are hardly recognizable. The character strings under the images are recognition results of ASTER [41]. Those in red denote wrongly recognition.

6 Conclusion and Discussion

In this work, we verify the importance of scene text image super-resolution task. We proposed the TextZoom dataset, which is, to the best of our knowledge, the first real paired scene text image super-resolution dataset. The TextZoom is well annotated and allocated and divided into three subset:easy, medium and hard. Through extensive experiments, we demonstrated the superiority of real data over synthetic data. To tackle text images super-resolution task, we build a new text-oriented SR method with three novel modules. Our TSRN clearly outperforms 7 SR methods with large gap. It also shows low-resolution text SR and recognition is far from being solved, thus more research effort is needed.

In the future, we will capture more appropriately distributed text images. Extremely large and small images will be avoided. The images should also contain more kinds of languages, such as Chinese, Korean and Japanese. We will also focus on new methods such as introducing recognition attention into the text super-resolution task.

Appendix

Appendix 0.A Is Scene Text Image Super-Resolution Necessary?

0.a.1 Training Recognizer on Low-Resolution Images.

It is assumed that we could achieve better performance on recognizing low-resolution (LR) text images if we directly train the recognition networks on small size images, and then the super-resolution procedure could be removed. This query is reasonable because the deep neural networks have a strong robustness on the training domains. To refute this query and prove the necessity of super-resolution for text images, we compare the recognition accuracy of three methods:

  • Recognize with ASTER [41] model trained on customary size (no less than 32 pixels in height, We use official released model here).

  • Use our proposed TSRN to generate the SR images and then recognize them with ASTER [41] official released model.

  • Recognize with model trained on low-resolution images (In this work, we re-implemented ASTER [41] on Syn90K [16] and SynthText [12] at the size of 6416, All the training details are the same as the original paper except the input sizes).

Method Accuracy
easy medium hard average
ASTER(Released) 64.7% 42.4% 31.4% 47.2%
TSRN(ours) + ASTER(Released) 75.1% 56.3% 40.1% 58.3%
ASTER(ReIm) 70.1% 48.3% 35.9% 52.6%
Table 5: Comparison between different methods. Released means official released model from github. ReIm means our re-implemented model trained on Syn90K [16] and SynthText [12] at the size of 6416.

From Table 5, we can figure that the re-implemented model do increase the accuracy sharply on the LR images. The average accuracy of BICUBIC can be increased by 5.4%, from 47.2% to 52.6%. (Note that our re-implemented model also boost up the performance of other scene text recognition datasets, so part of the 5.4% of increase is the result of a better converged model.) But it is still much lower than the accuracy of our SR results (TSRN(ours) + ASTER(Released)). So the SR methods could be a effective and convenient pre-processing procedure of scene text recognition.

0.a.2 Speed Accuracy.

In this task, we take the recognition accuracy as the most import evluation metric. To figure out whether it is wise to increase the accuracy at the cost of the extra computation consumption of TSRN, we compare the number of parameter, FLOPs and inference FPS of w and w/o super-resolution. The inference FPS means the FPS of recognizing the text images w or w/o SR. Through Table 6, we can find that the proposed method is relatively tiny compared to the recognition network. The FPS of ‘with TSRN’ is nearly equal to direct recognition of attention based recognizer ASTER [41] and MORAN [31]. The FPS of CTC based recognizer CRNN decreases when adding the TSRN, but the improvement of accuracy is very considerable. So it would be a suitable manipulation to take super-resolution as a pre-processing procedure before recognition. (All of the FPSs were tested on a single GTX 1080Ti GPU with the same batch-size of 50.)

Computation Cost Analysis
Recognizer TSRN(ours) Average Accuracy FLOPs Parameters Inference FPS
ASTER [41] 47.2% 4.72G 20.99M 21.97
58.3% (+10.1%) 4.72G + 0.72G 20.99M + 2.8M 21.67
MORAN [31] 44.1% 0.73G 20.3M 63.2
54.8% (+10.7%) 0.73G + 0.72G 20.3M+2.8M 59.6
CRNN [40] 26.8% 0.64G 8.3M 514.7
41.4% (+14.6%) 0.64G + 0.72G 8.3M + 2.8M 340.6
Table 6: Computation and speed comparison between w or w/o super resolution when recognize TextZoom. ‘’ means directly recognizing BICUBIC up-sampled LR images. ‘’ means recognizing after super-resolving images by our TSRN. The inference FPS means the FPS of recognizing w or w/o SR.

Appendix 0.B Extensive Experiments on Our Method

0.b.1 Binary Mask.

In text images, the characters are usually in a unified color. The only texture information is the character color and background color. For brevity, we concatenate the binary mask with text images as input (Fig. 7). The character regions render 1 and the background regions render 0. This input can be viewed as a transcendental semantic segmentation label of text images since most of the text images only contain 2 colors: the text color and background color. The masks are simply generated by calculating the average gray scale of the RGB images.

Figure 7: The demonstration of the strategy we annotated the RealSR.
Table 7: The detailed information of the text images cropped from RealSR dataset.
Ablation Study of Masks
Configuration Mask Accuracy
easy medium hard
5 SRB 73.9% 51.6% 36.0%
74.5% 53.3% 37.3%

0.b.2 Discussion about SRB.

To build the best architecture of SRB, we gradually modify this two essential configuration: the number of hidden units and the number of blocks. Our method select 5 SRB with 32 hidden units each. In this section, we do ablation study on this two component separately.

1) Hidden Units. The BLSTMs are used to build sequence dependence in the text lines, so we hypothesize that more hidden units could get better performance. By the experiments, we compare 0, 16, 32, 64, 128 of hidden layers. 0 Hidden Units represents SRResNet. The results demonstrate that the network would achieve best accuracy when the number of hidden unit equal 32 (Table 9) Too many hidden units achieve lower performance since it already build the sequence-dependence well.

Ablation Study of Hidden Units
Configuration Accuracy
SRBs Hidden Units easy medium hard
5 0 69.6% 48.3% 34.3%
16 71.6% 52.1% 36.3%
32 74.5% 53.3% 37.3%
64 71.9% 50.8% 35.8%
128 71.4% 47.3% 33.1%
Table 9: Comparison between different number of SRBs of our proposed method on TextZoom.
Ablation Study of SRBs
Configuration Metrics
SRBs Hidden Units easy medium hard
4 32 73.3% 52.1% 35.8%
5 74.5% 53.3% 37.3%
6 74.1% 52.7% 37.0%
7 72.3% 50.9% 35.6%
Table 8: Comparison between different number of hidden units of our proposed method on TextZoom.

2) Block Number. To figure out whether we can achieve better performance by building deeper network, we stack different number of SRBs to compare the performance. In Table 9, we compare our method with 4, 5, 6, 7 SRBs. We can find that more SRBs may not boost up the performance. The accuracy of 7 SRBs even decrease obviously. Stacking 5 SRBs, the network saturates and could get the best performance.

Our configuration of Sequence Residual Block is then shown in Table 10.

Type Configurations
FeatureMap B64HeightWdith
Convolution #maps:64, k:33, s:1 p:1
BatchNormalization
PReLU
Convolution #maps:64, k:33, s:1 p:1
BatchNormalization
Convolution #maps:64, k:11, s:1 p:0
Permutuation
Bi-LSTM #hidden_units: 32
Map-to-Sequence
Permutuation
Bi-LSTM #hidden_units: 32
Map-to-Sequence
Permutuation
Short Cut Connection
FeatureMap B64HeightWdith
Table 10:

Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively.

0.b.3 Psnr Ssim.

To calculate the PSNR[dB] and SSIM, we borrow the code from 444https://github.com/open-mmlab/mmsr. From Table 11, our PSNR of medium and hard subsets are not so good because PSNR is pixel-to-pixel calculated, while SSIM is calculated with a 1111 sliding kernel. The central alignment module would introduce slight pixel shift so the PSNR is somewhat lower than other SR methods. Usually, PSNR and SSIM could not represent the visual quality of the images [26], in this task, it is also not so important compared to accuracy.

Method Loss Function PSNR SSIM
easy medium hard easy medium hard
BICUBIC 22.35 18.98 19.39 0.7884 0.6254 0.6592
SRCNN [8] 23.48 19.06 19.34 0.8379 0.6323 0.6791
VDSR [23] 24.62 18.96 19.79 0.8631 0.6166 0.6989
SRResNet [26] 24.36 18.88 19.29 0.8681 0.6406 0.6911
RRDB [25] 22.12 18.35 19.15 0.8351 0.6194 0.6856
EDSR [27] 24.26 18.63 19.14 0.8633 0.6440 0.7108
RDN [51] 22.27 18.95 19.70 0.8249 0.6427 0.7113
LapSRN [24] 24.58 18.85 19.77 0.8556 0.6480 0.7087
TSRN(ours) 25.07 18.86 19.71 0.8897 0.6676 0.7302
Table 11: PSNR and SSIM results of different SR methods on TextZoom.

Appendix 0.C Central Alignment Module

Our central alignment module is based on Spatial Transformation Network [18]. The network predicts a set of control points and then then image is rectified by a Thin-Plate-Spline(TPS) [4] transformation. Our central alignment module mainly use horizontal or vertical shift. But sometimes the background region need different transformation scale to let the character region more central placed. So we use TPS transformation here to let the transformation flexible. As shown in Figure. 8, the transformation is different between different points.

Figure 8: Demonstration of central alignment module.

0.c.1 Performance on Manual Enlarged Misalignment.

We can find from ablation study that the central alignment module could improved the average accuracy for less than 2.0%. Indeed, it can perform better on more misaligned text image pairs. To prove that, we do data augmentation aiming at generating more misaligned image pairs. We crop our dataset TextZoom using a box with a 90% width and 90% height of the original image size randomly slide on the LR image, and get a region of 90%90% image. The HR images are not cropped. We train on the cropped dataset and evaluate on TextZoom. In Table 12, we show the performance of central alignment module on our manual cropped misalignment data. From the results in Table.12, we can find that the accuracy could be sharply improved.

Method Accuracy
easy medium hard average
5SRB 66.8% 50.0% 35.0% 51.6%
5SRB+Align 74.4% 55.6% 38.8% 57.4%
Improvement +7.6% +5.6% +3.8% +5.8%
Table 12: Performance of w or w/o central alignment module on TextZoom which was trained on the mannual enlarged misaligned data.
Figure 9: Comparison of w or w/o central align on enlarged misaligned data. The character strings under the images are the recognition results tested by ASTER [41]. Those in red means wrongly recognition. The performance of w or w/o alignment is obvious.

We show visulization results in Fig. 9. The third row is the SR images trained without alignment. We can find that the double shadow and artifacts are very serve when trained without central alignment module. We can find that many words are still correctly recognized even with strong double shadow.

0.c.2 Plugged into Other SR Methods.

In this study, we compare the performance of w or w/o central alignment module on our dataset. (Table 13). We display the performance of six models: w and w/o Central Alignment Module on SRResNet, LapSRN and ours seperately. The improvement of central align on these methods illustrate that it is a conveniently pluggable module for SR networks, and all the performance could be improved.

Method Alignment Accuracy
easy medium hard average
SRResNet 69.6% 47.6% 34.3% 51.7%
70.0% 49.6% 36.0% 53.0%
LapSRN 71.5% 48.6% 35.2% 53.0%
71.7% 50.3% 35.7% 53.7%
5 SRB 74.5% 53.3% 37.3% 56.2%
74.8% 55.7% 39.6% 57.8%
Table 13: Comparison between w or w/o Central Alignment Module on TextZoom.

0.c.3 Comparison with CoBi Loss.

CoBi Loss was proposed in [50] to tackle the misalignment. It is based on Contextual Loss [35]

. It modified the nearest neighbor search and considers local contextual similarities with weighted spatial awareness. The CoBi Loss used pre-trained VGG-19 features and select several conv layers as deep features. Its formulation is shown in Eqn. 

3 4 5. The results are shown in Table 14. It is less practical in this task because the pre-trained model is trained on a classification dataset.

(3)
(4)
(5)
Method Accuracy
easy medium hard
CoBi Loss 74.0% 51.6% 36.0%
+ alignment 74.8% 55.7% 39.6%
Table 14: Comparison between CoBi Loss and central alignment module.

Appendix 0.D Detailed Information of TextZoom.

0.d.1 Annotation of SR-RAW and RealSR.

SR-RAW [50] is collected by seven different focal lengths with SONY FE camera, ranging from 24-240mm. We demonstrate it in Fig. 10. There are totally 500 images in SR-RAW dataset, where 450 in train set and 50 in test set. The images are then aligned via field of view (FOV) matching and geometric transformation. The images captured in shorted focal lengths could be used as LR images while those captured in longer lengths as corresponding ground-truth. The author of SR-RAW [50] applied down-sample operation as offset when the ratio does not match precisely. For example, when use (35mm, 150mm) pairs to train a 4X model, the 150mm images should be down-sampled to 140mm at first. In our project, we follow this strategy in our dataset pre-processing. We annotate all the images taken from 240mm focal length which contains recognizable text in SR-RAW dataset. AS showed in Fig. 10, the focal length decreases from left to right, from 240mm to 24mm. The smaller the focal length, the smaller the field of view. The annotated text images have the same text contexts but different resolutions. We display three groups in Fig. 10: ‘STAR’, ‘QUEST’, ‘510-401-4657’. In this image, the text images cropped from 35mm and 24mm are hardly recognizable. How many clear images in a group of 7 images mainly depends on the height of original box in the 240mm focal lengths images.

Figure 10: The demonstration of the SR-RAW paired images and how we cropped text images.

In Table 15, we show the information of the cropped text images in SR-RAW. In the original images, Some groups of images do not have the 7th image, so the number of 24mm is less than the others. Through the table we can figure out that the recognition accuracy decreases obviously as the resolution degrades. We use the released ASTER [41] model to test the accuracy.

Text Images in Train Set of SR-RAW
Focal Length 240mm 150mm 100mm 70mm 50mm 35mm 24mm
Original Image Number 393 393 393 393 393 393 365
Text Box Number 9160 9160 9160 9160 9160 9160 8119
Recognition Accuracy 81.4 69.0 52.1 38.6 25.7 15.0 7.9
Text Images in Test Set of SR-RAW
Original Image Number 50 50 50 50 50 50 48
Text Box Number 1734 1734 1734 1734 1734 1734 1630
Recognition Accuracy 72.4 65.3 54.4 35.6 23.2 13.6 6.3
Table 15: The detailed information of the text images cropped from SR-RAW dataset. The 2nd to 7th groups of text images are cropped following the annotated bounding box in the 1st group.

RealSR [5] is captured by two Digital Single Lens Reflex(DSLR) cameras: Canon 5D3 and Nikon D810 with four focal lengths: 105mm, 50mm, 35mm, and 28mm. In RealSR [5], the images taken by 105mm focal length are used to generate HR images, while images taken by 50mm, 35mm, 28mm are used to generate 2X, 3X, 4X LR images separately. separately. For convenience, we only crop the 105mm, 50mm and 28mm. The non-horizontal text images are rotated to the most suitable angle for recognition (see Fig. 11).

In Table 11, we briefly show the statistics of text images in RealSR.

Figure 11: The demonstration of the strategy we annotated the RealSR.
Table 16: The detailed information of the text images cropped from RealSR dataset.
Text Images in RealSR
Focal Length 105mm 50mm 28mm
Original Number 115 115 115
Text Box Number 6048 6048 6048
Recognition Accuracy 75.0 46.1 16.7

Align. In RealSR, the author aligned the image pairs by introducing a pixel-wise registration algorithm which take luminance difference into consideration. In SR-RAW [50], the Euclidean motion model is used as the pre-processing procedure. During training, a contextual bilateral loss is proposed to leverage the misalignment, but a pre-trained model is needed, and it brings high computation consumption. We adapted their proposed pre-processing method to align the original images and cropped our dataset following our annotation principal. While in training, we used central alignment module as replacement.

0.d.2 Accuracy by Height.

The size of the cropped text boxes is diversed, We can figure that with the similar focal length, the accuracy of text images in RealSR is much higher than that in SR-RAW (Table 11 15). This mainly due to that the SR-RAW images are taken from a longer distance. So it is suitable to allocate images cropped from RealSR as subset easy.

We divided the previous cropped images by height and found that the accuracy is relatively good when the height reaches 16-32 pixels, which is showed in Table 17. The images sized in (16-32) and (8-16) claim the majority in all the groups. The accuracy of the images smaller than 8 pixels are too low, which hardly have any value for restoration. The images are hardly recognizable, so we discard the images the height of which is less than 8 pixels. (8-16, 16-32) should be a good pair to form a 2X train set for STR super-resolution task. For example, the text images taken from 150mm focal length and height sized in 16-32 pixels would be taken as a ground-truth for the 70mm counterpart. So we selected all the images the height of which range from 16 pixels to 32 pixels as our ground-truth image and up-sample them to the size of 12832 (widthheight), and the corresponding 2X LR images to the size of 6416 (widthheight).

Recognition Accuracy of images in different height
Height(pixels) 128 64128 3264 1632 816 48 04
Number 1586 3957 9663 14862 15434 11866 5711
Recognition Accuracy 75.2 84.2 84.6 79.5 39.1 2.8 0.3
Table 17: The recognition accuracy of the text images divided by height.

0.d.3 Statistical information.

(a) Character distribution.
(b) Character number in each image.
(c)

Direction of the bounding boxes. For better display, we list the logarithm of the ordinate number.

(d) Distribution of text contents in TextZoom.
Figure 12: Statistical information of TextZoom.

We display some useful statistical information in Fig. 12

. (a) Our dataset contains abundant characters and digits, including some punctuation. (b) Most of the lengths of the words range from 1-8 characters. (c) There are many randomly-placed boxes and books in the original images, so we count the direction type of the bounding boxes we annotated. ‘Horizontal’ means that the text image is horizontal placed, easy to read. ‘Vertical(+)’ denotes the text image is vertical and it should be rotated following the clockwise direction for 90 degrees, while ‘Vertical(-)’ denotes following anti-clockwise direction for 90 degrees. ‘Top-down’ denotes that the text image should be rotated 180 degrees for the best recognition. ‘Curve’ denote the text image is curved. ‘Ignored’ means that the text is illegal (not digits, English letters or punctuation). (d) Via the generic lexicon which has 90k common words used in ICDAR2015 

[21], we figure that 57.5% of the text contents are common English words. Plate includes car license plates, door number plates or street signs. They are the combination of digits, punctuation and letters. This kind of text account for 12% because there are many street views in the original images. Uncommon word claims 18.2% in all the texts. This kind of text are mainly rare words, phrases or compound words. Other meaningless strings like punctuation, single letter and digits account for the rest.

0.d.4 Task Analysis.

Our dataset is challenging mainly for two reasons: the misalignment and ambiguity. Misalignment is unavoidable during data capture when the lens zoom in and out. Any slight camera movement could cause tens of pixels shift, especially the short focal lengths. And the pre-processing procedure cannot totally eliminate misalignment. We display some example images in Fig. 13.

From Fig. 13, we can figure that the misalignment varies and no specific regulation can be found since we do not have pixel-level annotation of the word location. The three different subsets are allocated appropriately by the difficulty. The misalignment and ambiguity becomes server as the difficulty increases. Note that the characters in HR images tend to locate in the center compared to those in LR. This mainly owing to that when we annotated the HR images, we artificially keep the text boxes at the centre of the images.

(a) Example images of easy subset.
(b) Example images of medium subset.
(c) Example images of hard subset.
Figure 13: Demonstration of the images in TextZoom. The misalignment and ambiguity becomes server as the difficulty increases.

References

  • [1] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. arXiv preprint arXiv:1904.01906. Cited by: §1.
  • [2] M. Bevilacqua, A. Roumy, C. Guillemot, and M. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, Cited by: §2.
  • [3] T. Björklund, A. Fiandrotti, M. Annarumma, G. Francini, and E. Magli (2019) Robust license plate recognition using neural networks trained on synthetic images. Pattern Recognition. Cited by: §1.
  • [4] F. L. Bookstein (1989) Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Appendix 0.C.
  • [5] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In ICCV, Cited by: §0.D.1, §1, §2, §3, §3, §3.
  • [6] C. Chen, Z. Xiong, X. Tian, Z. Zha, and F. Wu (2019) Camera lens super-resolution. In CVPR, Cited by: §2.
  • [7] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou (2017) Focusing attention: towards accurate text recognition in natural images. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, §2, §3.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. TPAMI. Cited by: Table 11, §1, §2, §2, §5.5, Table 4.
  • [9] C. Dong, X. Zhu, Y. Deng, C. C. Loy, and Y. Qiao (2015) Boosting optical character recognition: a super-resolution approach. arXiv preprint arXiv:1506.02211. Cited by: §2.
  • [10] V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.) (2018) ECCV. Cited by: §3.
  • [11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. Int. Conf. Mach. Learn., Cited by: §2.
  • [12] A. Gupta, A. Vedaldi, and A. Zisserman (2016) Synthetic data for text localisation in natural images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 3rd item, Table 5.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
  • [14] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang (2016) Reading scene text in deep convolutional sequences. In AAAI, Cited by: §4.2.
  • [15] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In CVPR, Cited by: §2.
  • [16] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227. Cited by: 3rd item, Table 5.
  • [17] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman (2016) Reading text in the wild with convolutional neural networks. IJCV. Cited by: §2.
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Proc. Adv. Neural Inf. Process. Syst., Cited by: Appendix 0.C, §2, §4.3.
  • [19] M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Deep features for text spotting. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
  • [20] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In Proc. Eur. Conf. Comp. Vis., Cited by: Table 4.
  • [21] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: §0.D.3, §5.1.
  • [22] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In Proc. IEEE Int. Conf. Doc. Anal. and Recogn., Cited by: §5.1.
  • [23] J. Kim, J. K. Lee, and K. M. Lee (2016) Accurate image super-resolution using very deep convolutional networks. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.
  • [24] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.3, §5.4, §5.5, Table 4.
  • [25] L. Leal-Taixé and S. Roth (Eds.) (2019) Computer vision - ECCV 2018 workshops - munich, germany, september 8-14, 2018, proceedings, part V. Lecture Notes in Computer Science. Cited by: Table 11, §1, §2, §3, §5.5, Table 4.
  • [26] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §0.B.3, Table 11, §1, §2, §3, §4.1, §4.2, §4.4, §5.3, §5.4, §5.4, §5.5, Table 4.
  • [27] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.
  • [28] W. Liu, C. Chen, K. K. Wong, Z. Su, and J. Han (2016) STAR-net: a spatial attention residue network for scene text recognition.. In Proc. Brit. Mach. Vis. Conf., Cited by: §1, §2.
  • [29] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu (2018) Squeezedtext: a real-time scene text recognition by binary convolutional encoder-decoder network. In Proc. AAAI Conf. on Arti. Intel., Cited by: §1, §2.
  • [30] S. Long, X. He, and C. Ya (2018) Scene text detection and recognition: the deep learning era. arXiv preprint arXiv:1811.04256. Cited by: §1.
  • [31] C. Luo, L. Jin, and Z. Sun (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recognition. Cited by: §0.A.2, Table 6, Figure 2, §1, §2, §2, §3, §5.5, Table 2, Table 4.
  • [32] C. Mancas-Thillou and M. Mirmehdi (2007) An introduction to super-resolution text. In Digital document processing, Cited by: §2.
  • [33] D. R. Martin, C. C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Cited by: §2.
  • [34] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa (2017) Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications. Cited by: §2.
  • [35] R. Mechrez, I. Talmi, and L. Zelnik-Manor (2018) The contextual loss for image transformation with non-aligned data. In ECCV, Cited by: §0.C.3.
  • [36] R. K. Pandey, K. Vignesh, A. Ramakrishnan, et al. (2018) Binary document image super resolution for improved readability and ocr performance. arXiv preprint arXiv:1812.02475. Cited by: §2.
  • [37] C. Peyrard, M. Baccouche, F. Mamalet, and C. Garcia (2015) ICDAR2015 competition on text image super-resolution. In ICDAR, Cited by: §2.
  • [38] A. Ray, M. Sharma, A. Upadhyay, M. Makwana, S. Chaudhury, A. Trivedi, A. P. Singh, and A. K. Saini (2019) An end-to-end trainable framework for joint optimization of document enhancement and recognition. In ICDAR,, Cited by: §1.
  • [39] J. Sánchez, V. Romero, A. H. Toselli, M. Villegas, and E. Vidal (2019) A set of benchmarks for handwritten text recognition on historical documents. Pattern Recognition. Cited by: §1.
  • [40] B. Shi, X. Bai, and C. Yao (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Table 6, Figure 2, §2, §2, §3, §4.2, §5.5, Table 2, Table 4.
  • [41] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: 1st item, 2nd item, 3rd item, §0.A.2, Table 6, Figure 9, §0.D.1, Figure 2, Table 1, §1, §2, §2, §3, Figure 6, §5.4, §5.5, Table 2, Table 3, Table 4.
  • [42] J. Sun, J. Sun, Z. Xu, and H. Shum (2011) Gradient profile prior and its applications in image super-resolution and enhancement. TIP. Cited by: §4.4.
  • [43] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao (2016) Detecting text in natural image with connectionist text proposal network. In ECCV, Cited by: §4.2.
  • [44] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In CVPRW, Cited by: §2.
  • [45] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape robust text detection with progressive scale expansion network. In CVPR, Cited by: §1.
  • [46] W. Wang, E. Xie, X. Song, Y. Zang, W. Wang, T. Lu, G. Yu, and C. Shen (2019) Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. Cited by: §1.
  • [47] Y. Wu, F. Yin, and C. Liu (2017) Improving handwritten chinese text recognition using neural network language models and convolutional neural network shape models. Pattern Recognition. Cited by: §1.
  • [48] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li (2019) Scene text detection with supervised pyramid context network. In AAAI, Cited by: §1.
  • [49] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, Cited by: §2.
  • [50] X. Zhang, Q. Chen, R. Ng, and V. Koltun (2019) Zoom to learn, learn to zoom. In CVPR, Cited by: §0.C.3, §0.D.1, §0.D.1, §1, §2, §3, §3, §3.
  • [51] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In CVPR, Cited by: Table 11, §1, §2, §5.5, Table 4.