Script identification is to predict the script of a given text image, having played a more and more important role in multilingual systems nowadays. Under many circumstances, it acts as a prerequisite to decide which language model to use for further text detection or recognition.
where texts hold regular layout and simple background have achieved great performance. But when it comes to scene text script identification which extends the application to more fields like scene understanding, additional challenges emerge, like the complex background, various text styles and diverse noise, etc. Our work focuses on scene text, taking on challenges as follows:
Some scripts have relatively subtle differences, e.g., Russian and English, which share a large set of characters. Distinguishing them is exactly a fine-grained classification problem requiring discriminative features.
Cropped text images have arbitrary aspect ratios, making it necessary to find an effective way to feed them into the model in the batch-based training phase.
The first challenge is crucial in script identification where the bottleneck mainly comes from the scripts of the same family sharing some common characters. Hence, local discriminative features are always paid much attention to. Almost all the works focus on collecting critical features without suppressing the redundant features acting as noise. Some works [27, 8, 30] adopted clustering on deep convolutional features for critical descriptors. There were multi-stage training process and great computation due to clustering. Inspired by Siamese network , Gomez et al.  proposed an improved patch-based method containing an ensemble of identical nets to learn discriminative stroke-part representations. Mei et al. 
adopted Convolutional Recurrent Neural Networks to extract the image representation and spatial dependency which is discriminative in spite of sharing characters. Fujii et al.  use Encoder and Summarizer to get local features and fuse them to a single summary by attention mechanism  to reflect the importance of different patches. Ankan et al.  proposed an attention-based Convolutional-LSTM network, analyzing features globally and locally which is popular in fine-grained classification [29, 33].
Although having made great progress, the works above taking features of all patches into account suffer from a fatal issue that the domination of the discriminative features could be reduced by other weak-discrimination features. Especially, a text line attached to a specific script can consist of many characters belonging to the intersection of several scripts, making a model prone to suffer from redundant features. As shown in Figure 1, the text line in the left only consisting of shared characters can be either Chinese or Japanese. However, the right one with only one character added is definitely Japanese, which shows the great power of the discriminative features. But the existing works cannot make good use of the features. For example, if we average all the character patches with weights, the power of the discriminative patches will be reduced by a much larger number of shared character patches. A similar case is in Figure 1 where the few Russian-specific characters on the right are critical.
The discriminative part is expected to be dominant even if with smaller quantity. Here we propose Patch Aggregator (PA) to learn and aggregate the local features. PA makes patch-level predictions as an explicit representation, from which we can know what scripts the patches of a given image could be. After that, by simply max-pooling the predicted probability distributions, the relation between the input image and every script is obvious. This is a low-dimensional but important discriminative feature representation. Based on this, a simple linear classifier will make a local-level prediction about the whole image. For example, the right image in Figure1 contains patches attached to two scripts, i.e., Chinese and Japanese, which form the low-dimensional discriminative features. PA will predict that it is more likely to be Japanese if both Chinese and Japanese-specific characters occur in an image. But when no Japanese-specific character occurs, as in the left in Figure 1, PA will infer it as Chinese. The process can be learned well in the training stage.
As for the problem of arbitrary aspect ratios, recent methods with good performance take densely cropped image patches with fixed size as input [8, 9, 30, 2]. They also employ data augmentation somehow, but they suffered from the following three issues. Firstly, A cropped image patch may bring noise caused by sudden breaking off. And the feature extractor cannot catch its surroundings in other patches, which limits the feature representation for losing the holistic context messages. Secondly, heavy redundancy of overlapped patches could lead to much repeated computation, pulling down the efficiency during test phase. Thirdly, the samples with larger aspect ratios in some scripts could make more cropped patches which may cause great data imbalance, disturbing the training to some degree. Hence, our input prefers full-size images to cropped patches. Shi et al.  designs a spatially-sensitive pooling layer by pooling horizontally on the intermediate feature map so that the width of the input image can be flexible. We adopt pooling strategy to solve the problem too, but our pooling process intends to keep more useful information and be more interpretable.
In this work, we employ an end-to-end CNN-based method consisting of a standard CNN classifier called Global Squeezer (GS) and a PA module, as shown in Figure 2. In training phase, we design a novel loss called softermax loss to take patch-level predictions under the weak supervision of the ground truth label in PA, since the label of the whole text image sometimes cannot imply the exact classes of patches for the characters sharing of some scripts. All other predictions are supervised by softmax loss. Succinctly, the main contributions of this paper are as follows:
We propose PA to aggregate patch-level predictions to learn a discriminative representation, which has high interpretability. PA along with GS can process images with arbitrary aspect ratios in a simple but effective way.
We design softermax loss to accomplish patch-level weak supervision on local predictions with image-level label.
Our proposed method can perform script identification simply and effectively. Convolutional operation is directly imposed on a full-size image instead of cropped patches.
. The patch mentioned here is a single pixel of specific deep feature map with a proper size of receptive field. A shared convolutional structure acts as a basic feature extractor in the framework, followed by two modules called Global Squeezer (GS) and Patch Aggregator (PA) respectively. GS aims to squeeze the holistic representation while PA makes predictions over local features and aggregates them by inference which is able to make full use of the discriminative features. Finally, we fuse them dynamically in a learnable way. The entire network can be trained end-to-end in one stage.
Ii-a Global Squeezer
Once we get the basic feature by the shared convolutional structure, Global Squeezer (GS), a common classifier, makes a global prediction. Firstly a tiny convolutional structure gets the feature map , which meets the demands of globally squeezing, including receptive field and dimension. Subsequently we squeeze (of size ) across spatial dimensions by Global Average Pooling (GAP) to get a channel-wise global descriptor , where is the number of channels. This can be described as Eq.1.
GAP squeezes holistic feature representation across each channel, since a convolutional feature channel often corresponds to a certain type of visual pattern . Then the holistic representation is fed into a linear classifier to get the global prediction scores over classes.
Ii-B Patch Aggregator
It’s far from being sufficient to learn discriminative features by just making prediction from a single global perspective as GS does. Specifically, attention mechanism  has been widely applied to accomplish discriminative learning extensively. However, it seems not so valid as for script identification due to the effect caused by redundant features. Here we specify the novel Patch Aggregator (PA) which learns and employs the discriminative features better.
PA starts with the same tiny convolutional structure as in GS to make pixels in deep feature map of proper receptive fields which result in precise patch-level scores implemented by convolution. Then softmax function converts the scores to probability distributions (of size across the classes). This step goes under a special intermediate supervision discussed in II-D1 during training. The patch-level scores actually act as high-level semantic features where the discriminative representation can be extracted .
Taking account of the impairment caused by redundant features, we adopt Global Max Pooling (GMP) when we aggregate the prediction scores of patches to pick the most remarkable response per class, which is highly interpretable. The process can be described as Eq.2.
where is the score of the patch in position corresponding to the class. After we pick out the maximum of per class, reflects the likelihood of the given image appertaining to every class, thus we can know which scripts the components of the input image could belong to. Then a two-layer linear classifier gets the scores over classes from local perspective.
Visualization of the behaviour in the module is available in III-D.
To combine the outputs of the above two modules adaptively, we adopt dynamic weighted fusion. The weight of global output just depends on itself somehow, denoted as . Then the weight of is the complement . The fusion process can be shown in Eq.3 and Eq.4. Eq.3 show the mapping process and
is the sigmoid function.and are trainable parameters of linear layers.
Ii-D Loss Functions
In the training stage, the proposed network is optimized by four losses— and as shown in Figure 3, to make sure the network can work within our expectation.
GS and PA are both under supervision with and respectively to make sure they really learn well. is devised for the final decisive output which determines the performance of the model, holding a relatively higher weight. The three losses all use softmax loss based on the ground truth labels.
The loss is designed for the intermediate supervision as have been mentioned in section II-B. Since the categories of some patches cannot simply rely on the image-level label for the characters-sharing issue, the challenge turns out that the image-level label is not sufficient to supervise patch-level scores if we directly use softmax loss. Thus we propose the novel softermax loss to deal with the problem.
Ii-D1 Softermax Loss
Classical softmax loss pushes the model to output a much greater probability on the ground truth(GT) class than others. It makes the model excessively confident in GT, which is inappropriate for patch-level prediction for scripts confusion of some characters. To relieve the extreme and fully learn discriminative features in patch-level, we make the loss softer for a single patch, which can be formulated as in Eq.5.
where is the score about the i-th category at a specific location got by convolution, and are the top- elements of (
is a hyperparameter).prompt the top
probabilities to be as great as possible, alleviating the extreme of softmax loss up to a point. But it is unsupervised learning to just adopt the softermax loss, leaving the model prone to fall into local optimum.
Hence we couple the softmax and softermax loss to get a trade-off. The loss for an image is averaged over its patches, which is shown in Eq.6, where determines how softer Eq.6 could be, and is the softmax loss supervised simply by the label of the input image.
|Original Aspect Ratio Range||(0,3)||[3,6)||[6,12]||(12,)|
|New Aspect Ratio||2||4||8||16|
During training, the above losses contribute to the total loss by weights , as shown in Eq.7.
We conduct experiments on three public datasets for script identification. SIW-13  is officially split into 9,791 training and 6,500 test images of 13 scripts. CVSI2015  is released for the ICDAR 2015 Competition on Video Script Identification, containing text line images of 10 Indian scripts. RRC-MLT2017  is released for ICDAR 2017 Competition on MLT-Task2, comprising 68,613 training, 16,255 validation and 97,619 test cropped images. This dataset holds an extremely imbalanced distribution among 7 scripts and especially tilts to Latin. There exists some multi-oriented and curved texts which make it more challenging.
Iii-a Implementation Details
In terms of the diverse aspect ratios of the dataset images, we group every image by its aspect ratio and resize it to a fixed size determined by the group it belongs. The short side of all images are set to 32. Then we can train them with batches efficiently. The number of groups is determined by the dataset. To be clear, Table I shows the grouping resizing in SIW-13. For example, if an image has an aspect ratio of 3.5, we should resize it to size 32x128 where 32 is the fixed height. The same trick is used on CVSI2015 and RRC-MLT2017.
We also exploit some data augmentation like changing contrast, adding random noise, slightly cropping and making perspective transform to make full use of training data. Image data is normalized in range uniformly.
Our basic architecture uses VGG 14]
and ReLU. More details are shown in TableII for SIW-13 and CVSI2015, where Module 1-6 are the shared convolutional part and GS stands at the left while PA is at the right. Note that ”1-6” means the first 6 modules have the same structure shown in the right but with different number of filters and parameters. We use kaiming normalization  to initialize it. As for RRC-MLT2017, we take the convolutional part of VGG16 
pre-trained in ImageNet as the backbone due to the much more complex images. The design guarantees enough receptive field for patch-level prediction sores.
|No. of Module||Configuration|
|10||Linear:||Conv kernel:, stride:1, padding:0|
|Script||Zdenek ||Mei ||Gomez ||Bhunia ||ours|
In the experiments we have used PyTorch
for deep learning acceleration. During training, hyper parameters for Eq.5, Eq.6 and Eq.7 are: , , [
] = [0.1, 0.1, 1.0, 0.1], which can lead to the best accuracy. The batch size is 16. Stochastic gradient decent (SGD) is used for optimization with momentum and weight decay set to 0.9 and 1e-4 respectively. Learning rate starts with 0.1 and will decay by 0.3 if the training loss stop falling for a while. Every time it is lower than 8e-5, we reset it to 0.01 and going on training until the default epoch (500 for SIW-13 and CVSI2015, 100 for RRC-MLT2017) is reached. We conduct our experiment on an Nvidia Geforce GTX GPU with 10.9GB memory, one Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz and 64GB RAM. The training time is around 5 hours.
As for scene text line images in SIW-13, a great improvement has been made with a good balance among all scripts. The Sequence-to-label problem demands more comprehensive feature representation than sequential dependency, which is proved by the comparison between ours and CRNN  which has a popular application in scene text recognition . Besides, Mei  cost much time to predict a text line which may be caused by the sequential computation in RNN. Zdenek  used BLCT to enhance the discriminative features. But they imposed (inverse document frequency) to get codewords occurrences, suffering a lot from the impairment caused by less critical features. The image in Figure 4 which has much more Chinese patches than kana was misclassified to Chinese by their model. Bhunia  coupled local and global features, but it suffers from the impairment too. What is more, the use of many cropped patches can make considerably redundant computation and memory usage which can influence the efficiency especially in its LSTM module which precludes parallelization. Our model built with 26.7 MB parameters takes about 2.5 ms per image when we make test one by one, owing to the efficient matrix computation with a full-size image and the simple pipeline.
For CVSI2015, where the images with single background occurring in video caption, our method can reach the best accuracy among the published works. Former works are usually not able to have a proper balance between scene text and video caption, like Shi  and Nicolaou .
Iii-C Ablation Study
We conduct ablation studies on SIW-13 to show the power of our proposed PA along with GS and softermax loss.
Iii-C1 The contribution of the proposed module
Here we consider the contribution of PA by replacing the two-module (GS and PA) parts with other modules alternately while keeping the shared feature extractor.
Table V shows the results in detail, where GS means a single GS module is used without PA, and PA has the similar meaning. GS+GS is an ensemble model which can be regarded as that another GS takes the place of PA in Figure 3. GS+GMP just changes the GAP operation into GMP in one of the modules in GS+GS. GS+PA is the exact proposed method.
A single module is not enough for a fine-grained classifier to exploit information both globally and locally, which can be reflected by the result of GS and PA. GS cannot notice the fine-grained detail well and PA is prone to be limited in a sub-area. The ensemble model GS+GS only gets a slight improvement compared with GS, turning out that our proposed method should not attribute its great performance to ensemble. GS+GMP use GMP to extract the most remarkable responses in 512 dimensions which can be regarded as a kind of local features obtained in another way, but every dimension do not has an explicit meaning and cannot be supervised by label. So it only improves 0.3%, yet holding much more parameters. All of them highlight our proposed method, showing the power of integration of GS and PA.
Iii-C2 Effect of softermax loss
The proposed softermax loss mentioned in II-D1 is vital for PA in the training stage. We have investigated whether the supervision works and the importance of softermax loss.
Ablation results are shown in Table VI. The intermediate supervision makes great sense on the final accuracy, which can guide the mid-prediction to reach our expectation with explicit meaning. The weight in Eq.6 determines the influence of the softermax loss. “” in Table VI means that only softmax loss conducts the supervision. The result shows the significance of the softerness brought by softermax loss.
Iii-D Visualization Analysis
Insights into the behaviour of our proposed PA can be obtained by visualizing the vectors withdimensions which are probability distributions over classes. Specifically, we take the patch-level prediction, vector after GMP and the local prediction from the linear classifier (fc) as the objects to observe.
As shown in Figure 4, predictions for patches are not forced to an extreme and the probabilities scatter relatively high over several(here is 3) scripts, which agrees with the fact that a patch alone regarded as an independent subsample from input can actually correspond to several scripts. By GMP, a vector consisting of the most remarkable response over classes is actually a kind of high-level semantic feature which shows what scripts the components of the input could be. The local-level prediction can be obtained by further inference which is exactly a simple linear classifier. We can make full use of fine-grained discriminative features through the proposed procedure.
We present a simple but effective approach for scene text script identification. Patch Aggregator can learn discriminative features while having discriminatory power not been reduced by redundant features. It significantly improves the baseline model, Global Squeezer. The novel softermax loss is designed to make intermediate supervision on patch-level prediction. Our method achieves the best results on three benchmark datasets, demonstrating its great power.
This research was supported by the National Natural Science Foundation of China (NSFC) grants 61733007 and 61773176. Dr. Xiang Bai was supported by the National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I, §II-B.
-  (2019) Script identification in natural scene image and video frames using an attention based convolutional-lstm network. Pattern Recognition 85, pp. 172–184. Cited by: §I, §I, §III-B, §III-B, TABLE III, TABLE IV.
-  (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §I.
-  (2005) Texture for script identification. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (11), pp. 1720–1732. Cited by: §I.
The cityscapes dataset for semantic urban scene understanding.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §I.
-  (2017) Sequence-to-label script identification for multilingual ocr. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1, pp. 161–168. Cited by: §I.
-  (2005) Script recognition in images with complex backgrounds. In Signal Processing and Information Technology, 2005. Proceedings of the Fifth IEEE International Symposium on, pp. 589–594. Cited by: §I.
-  (2016) A fine-grained approach to scene text script identification. In Document Analysis Systems (DAS), 2016 12th IAPR Workshop on, pp. 192–197. Cited by: §I, §I.
-  (2017) Improving patch-based scene text script identification with ensembles of conjoined networks. Pattern Recognition 67, pp. 85–96. Cited by: §I, §I, TABLE III, TABLE IV.
-  (2010) Offline handwritten script identification in document images. Int. J. Comput. Appl 4 (6), pp. 6–10. Cited by: §I.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §III-A.
-  (1999) Script and language identification for handwritten document images. International Journal on Document Analysis and Recognition 2 (2-3), pp. 45–52. Cited by: §I.
-  (1997) Automatic script identification from document images using cluster-based templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (2), pp. 176–181. Cited by: §I.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §III-A.
-  (2007) A generalised framework for script identification. International Journal of Document Analysis and Recognition (IJDAR) 10 (2), pp. 55–68. Cited by: §I.
-  (2016) Scene text script identification with convolutional recurrent neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 4053–4058. Cited by: §I, §III-B, TABLE III, TABLE IV.
-  (2017) ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1, pp. 1454–1459. Cited by: item 3, §III.
-  (2016) Visual script and language identification. In Document Analysis Systems (DAS), 2016 12th IAPR Workshop on, pp. 393–398. Cited by: §III-B, TABLE III.
-  (2017) Automatic differentiation in pytorch. Cited by: §III-A.
-  (2018) E2E-mlt-an unconstrained end-to-end method for multi-language scene text. arXiv preprint arXiv:1801.09919. Cited by: TABLE III.
-  (2011) Video script identification based on text lines. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pp. 1240–1244. Cited by: §I.
-  (2015) ICDAR2015 competition on video script identification (cvsi 2015). In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pp. 1196–1200. Cited by: item 3, §III.
Script identification in the wild via discriminative convolutional neural network. Pattern Recognition 52, pp. 448–458. Cited by: item 3, TABLE III, §III.
-  (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §I, §III-B.
-  (2015) Automatic script identification in the wild. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pp. 531–535. Cited by: §I, §III-B, TABLE III.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-A.
-  (2012) Unsupervised discovery of mid-level discriminative patches. In Computer Vision–ECCV 2012, pp. 73–86. Cited by: §I.
-  (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Transactions on pattern analysis and machine intelligence 20 (7), pp. 751–756. Cited by: §I.
-  (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4148–4157. Cited by: §I.
-  (2017) Bag of local convolutional triplets for script identification in scene text. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, Vol. 1, pp. 369–375. Cited by: §I, §I, §III-B, TABLE III, TABLE IV.
-  (2016) Picking deep filter responses for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1134–1142. Cited by: §II-A.
-  (2012) New spatial-gradient-features for video script identification. In Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on, pp. 38–42. Cited by: §I.
-  (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In Int. Conf. on Computer Vision, Vol. 6. Cited by: §I.