Visual Script and Language Identification

01/08/2016 ∙ by Anguelos Nicolaou, et al. ∙ Universitat Autònoma de Barcelona UNIFI 0

In this paper we introduce a script identification method based on hand-crafted texture features and an artificial neural network. The proposed pipeline achieves near state-of-the-art performance for script identification of video-text and state-of-the-art performance on visual language identification of handwritten text. More than using the deep network as a classifier, the use of its intermediary activations as a learned metric demonstrates remarkable results and allows the use of discriminative models on unknown classes. Comparative experiments in video-text and text in the wild datasets provide insights on the internals of the proposed deep network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As document analysis systems are evolving, their multi-lingual capabilities are becoming more important. Script identification is a key element in multilingual system pipelines. Other than performance in detecting the script and the language of text in such pipelines, the position this step occupies in the pipeline dictates whether it will assist or be assisted by other steps in the pipeline.

In this paper we address the problem of script or language identification in several modalities such as video-text, scene-text, or handwritten text, and introduce a method consisting of hand-crafted features and a fully connected deep neural network. We demonstrate that k-NN classification over the features obtained from the first layer of the deep neural network equals or outperforms the deep network classification. The principal contributions of this paper are: the introduction of a method that uses a deep neural network on top of hand-crafted features for script identification, a method to perform a purely visual identification of language, even for languages sharing the same script, and the use of the activations of the employed neural network as a learned metric in order to generate more adaptable classifiers.

Ii Background

Ii-a Script Identification

Script detection has been an open problem for several decades. For the contents of this paper, script identification refers to identifying the system of writing, the alphabet used in a sample, while language identification refers to identifying the language given a text sample. The above definition produces ambiguities on some cases, yet those two notions from a pattern recognition perspective are very different. Script identification implies focussing on detecting symbols, while language identification implies detecting some specific auxiliary symbols, such as diacritics, and an underlying language model. Several variations of the problem exist depending on aspects such as the granularity of the data samples, the number of scripts out of which the systems classify, and the modality of the textual data, i.e. whether its printed text, handwritten text, scene text etc. For a detailed overview of script identification before 2009, we refer to

[1], which provides a thorough taxonomy of methods available up to that time. In 2009 Unnikrishnan and Smith [2]

demonstrated that for simple cases of binarized printed text, the problem can be considered solved by a method developed for the Tessaract OCR engine. Zhu et al. 

[3] have used codebooks generated from printed and handwritten data in order to perform handwritten language identification. Ferrer et al. [4] used the simple

pooled horizontally to perform script identification. More recently Long Short Term Memory (LSTM) networks have been used by Ulhasan et al. 

[5] for separating characters in multilingual text with a granularity of characters. Mioulet et al [6] also used a bidirectional variant of LSTM networks with a cascade of script detection, OCR and language models, in order to infer the language even when two languages share the same script. The ICDAR2015 Competition on Video Script Identification (CVSI) [7]

posed the problem of script identification over superimposed text in videos ; the four best participant methods were all using Convolutional Neural Networks (CNN). Shi et al. 

[8] have also used a deep CNN to address the problem of script detection in the wild. The CNN approach has the drawback of the need for vast computational resources as well as large amounts of annotated data. Depending on the granularity of samples, i.e. character based, word based, text-line based, and paragraph based, as well as the modality of text, whether it be scene-text, printed documents, or handwritten texts, script identification can be seen as many problems rather than one. From this perspective, the problem addressed in this paper, script identification of a word level granularity on scene-text, is a challenging one that is starting to gain momentum.

Ii-B Text as texture

This paper builds on previous work that introduced the Sparse Radial Sampling (SRS) variant of the Local Binary Pattern (LBP) [9]

. Classifying text regions by global texture descriptors is a strategy that proved effective in the task of writer identification and now is employed and used in a different problem and in the context of supervised learning. The strategy of analysing text using texture analysis, apart from yielding good results has the benefit of providing this information at an early stage in a pipeline and making it available to following stages of a text identification.

Iii Method

The proposed method consists of a preprocessing step, followed by LBP feature extraction, and training an Artificial Neural Network (ANN) on these features. The intermediary layers of the ANN are then used as a generative model to perform classification.

Iii-a Preprocessing

(a) (b) (c) (d)
1)
2)
3)
4)
Figure 1: Preprocessing and SRS-LBP transform for radius of 3. Data taken from SIW-10 dataset [10]. 1) Original image, 2)Preprocessed image, 3) of input images and 4) of preprocessed images.

Before passing images to the LBP transform, each image is preprocessed independently. Since the LBP transform is applied on a single channel image, instead of luminance, the principal component of all pixel colors was chosen in order to enhance the perceptual differences that are not attributed to luminance. In order to have a consistent LBP encoding between images with a light foreground on a dark background and images with a dark foreground on a light background, whenever the central band is darker than the image average, the image is flipped. The assumption is that more foreground pixels will exist in the central band between 25% and 75% of the image width. In fig. 1 rows 1) and 2) demonstrate some examples of the preprocessing. When global pooling and using local structure features such as the LBP or Histogram of Oriented Gradients (HOG), inverting the image has an effect equivalent to flipping it across both axes. This means that making all samples have light background on dark foreground is as important as enforcing all samples to be properly oriented. In Column (b) of Fig. 1 the effect the flipping has on the LBP transform can be seen.

Iii-B Local Binary Patterns

For feature extraction the SRS-LBP variant of LBP histogram features is employed. Briefly, the SRS-LBP embeds a clustering of the center-neighbourhood differences using Otsu’s [11] method; it also uses a dis-joined approach to obtain a multi-radius feature representation with linear complexity. LBP have several advantages for script identification: they exploit the bi-level nature textual images have, they are very fast to compute, and they are pooled over regions which makes them segmentation-free and an inherently global descriptor of an image region. In [4] Ferrer et al. extracted LBP features from text-lines by concatenating histograms of 4 horizontal stripes. In [10] Shi et al. used a deep convolutional network that employs horizontal pooling to discard spatial information along the horizontal direction.

In the same respect, and assuming images are either cropped words or cropped lines, the SRS-LBP were extracted for 3 regions in the images: the upper half of the image, the central half of the image, and the lower half of the image. The dimensionality of the extracted features-set is the product of the histogram size, the different radii and the pooling zones: . In Fig. 1, in column (d) a rare example where the SRS-LBP is fooled can be seen; this happens because the SRS-LBP assumes that the most significant contrast in an image with text will be the contrast related to foreground-background transitions.

Iii-C Classification

The remaining pipeline of the SRS-LBP is an unsupervised learning approach, aimed at totally different class and samples per class cardinalities. A deep Multi Layer Perceptron (MLP) is used as a classifier of the feature representation to a given and limited set of languages. The network consists of 3 fully connected layers plus the input layer. The first layer maps the 9,216 features to a dimensionality of 1,024, the second layer maps the data from 1,024 to 512 dimensions and the third layer maps the data to as many neurons as the number of classes. The output layer can be interpreted as the probability of the presented data belonging to each class. The activations for the layers are respectively

, , and the logistic function. In Fig. 2

a visual representation of the architecture can be seen. It should be pointed out that the number of parameters of the network varies depending on the feature vectors dimensionality as well as the number of classes in each dataset used. In the case of classifying word images to 10 classes, the model has 9,968,138 parameters.

For training111

For all experiments, the KERAS 

[12] framework was used.

, Stochastic Gradient Descent (SGD) is used with

categorical cross-entropy

as a loss function. Drop-out regularisers of 0.5 are used on each layer 

[13]. The batch-size is set to be proportional to the the number of samples per class, but no less than 32.

Figure 2: Architecture of the Proposed Neural Network

Iii-D MLP as Metric Learning

While the discriminative deep MLP performs well, it is quite restricted by the need of computational resources for training. More than that, deep networks require datasets of substantial size and with all classes represented in a balanced way. The other alternative is to use metric learning techniques, which can have some drawbacks. Metric learning methods tend to have quadratic and even cubical complexities with respect to feature dimensionality, therefore an intermediary dimensionality reduction technique must also be used. The idea of using neural networks dedicated to metric learning is best exemplified by the Siamese network architecture [14]. While Siamese networks have all the benefits of metric learning in typical classification tasks, such as digit image classification, the results are lower than the state-of-the-art classifiers [15]

. On the other hand, intermediary activations of CNN are being used as generic feature extractors which are then classified with off the shelf classifiers such as Support Vector Machines (SVM) 

[16]. The work presented in this paper is greatly influenced by the principal idea in [16] of using intermediary activations of deep CNN as generic features for standard classifiers in tasks other than what the original CNN was trained for. Established CNN architectures do not directly preserve the aspect ratio of samples. In the case of word samples, this means that the same letters could have a different representation if they appeared in words of different size. There is no straight forward solution to this problem, i.e. the winning method of the CVSI competition addressed this problem by performing a sliding window of a fixed aspect ratio in each word and selecting the window with highest activation [7]. The drawback in such an approach is that all information outside the maximal activation window is ignored. The authors propose the use of hand-crafted features that can address the aspect ratio problem. Specifically the authors use the SRS-LBP histograms as inputs to a deep MLP since the pooling mechanism of the LBP histograms preserves perfectly the aspect ratio. Building on the idea of [16] the activations of the early layers in the MLP are used as input to a Nearest Neighbour classifier. Depending on the dataset, lower levels of the proposed MLP can reach in performance and even exceed its output layer. At the same time, networks used with this strategy are not limited to classes available during training. In Fig. 3 the error rates of all layers during training of the proposed MLP on the CVSI2015 dataset can be seen.

Figure 3: Training of the MLP.

Iv Experiments

In the experimental section we present experiments on script identification and language identification that demonstrate the potential of the proposed approach222Additional experimental resources are available at http://nicolaou.homouniversalis.org/2016/01/07/visual_script.html.

Iv-a Video-text Script Identification

Language C-DAC HUST CVC Google CUK Layer 1, 1NN Layer 2, 1NN Layer 3
Arabic 97.69 100.0 99.67 100.0 89.44 98.7 98.4 98.4
Bengali 91.61 95.81 92.58 99.35 68.71 99.6 99.3 99.6
English 68.33 93.55 88.86 97.95 65.69 98.7 98.4 97.2
Gujrathi 88.99 97.55 98.17 98.17 73.39 98.8 95.3 97.2
Hindi 71.47 96.31 96.01 99.08 61.66 100.0 99.6 99.6
Kannada 68.47 92.68 97.13 97.77 71.66 91.0 87.5 90.4
Oriya 88.04 98.47 98.16 98.47 79.14 99.6 99.6 99.6
Punjabi 90.51 97.15 96.52 99.38 82.55 98.1 97.8 97.8
Tamil 91.90 97.82 99.69 99.37 82.55 98.4 98.1 98.1
Telugu 91.33 97.83 93.80 99.69 57.89 98.4 98.1 100.0
Average 84.66 96.69 96.00 98.91 74.06 98.18 97.26 97.9
Table I: Accuracy % on the CVSI Video-text Dataset

The principal experiment to demonstrate near state-of-the-art performance is by comparing to the methods participating in the CVSI 2015 Video Script Identification [7]. The dataset contains of 10 languages used commonly in India: Arabic, Bengali, English, Gujrathi, Hindi, Kannada, Oriya, Punjabi, Tamil, Telugu. The dataset consists of cropped images containing a single word each. Most words appear to come from overlayed text, but there are also images that appear to be scene-text. The dataset comes partitioned to a train-set, a test-set, a validation set, and a small sample-set. For the experiments the test-set was isolated and used only for testing, the remaining data were mixed and partitioned randomly for training during the tuning of the proposed deep MLP architecture. While the competition defines four tasks that are related to different use cases specific to India, such as discriminating between languages occurring on the same regions, in our experiments we only address Task-4, classifying all 10 scripts, as it is the most generic task. In table I the performance per script of every participant to the competition can be seen along with the accuracy achieved by each layer. All layers of the proposed deep MLP rank on average second to the method submitted by Google. While the method of Google, the state-of-the-art, obtains 98.9% using a CNN, k-NN on the first layer of the deep MLP obtains 98.2% while layer 2 obtains 97.3% and the output layer obtains 97.9%. In Fig. 4 the confusion-matrices between languages for k-NN on the first layer, as well as the output layer can be seen. What stands out is the non-symmetric misclassification of 7% English samples as Kannada; all other confusions could be considered negligible. It can also be observed that layer 1 and layer 2 demonstrate some consistency.

(a) (b)
Figure 4: Confusion matrices for the CVSI dataset. Accuracy of the Nearest Neighbor for the activations of the first layer (a) and the third layer (b)

Iv-B Scene-text Script Identification

While the method was developed for video-text script identification, experiments on how it would perform on script detection in the wild were performed. We used the SIW dataset. Two variants of the dataset are publicly available. The SIW-10 [10] contains cropped word images in Arabic, Chinese, English, Greek, Hebrew, Japanese, Korean, Russian, Thai, and Tibetan. The SIW-13 [8] adds Cambodian, Kannada, and Mongolian to the languages of SIW. SIW-10 is partitioned in a train-set of 8,045 samples and test-set of 5,000 images while SIW-13 9,791 and 6,500 respectively. Brief experimentation suggested that the partition of SIW-13 is not compatible with SIW-10, as test samples from SIW-13 appear to be in the SIW-10 train-set. At the time of writing this paper the state-of-the art performance on the particular dataset is 94.6% and is achieved by the MSPN method introduced in [10]. Briefly, MSPN is a CNN developed specifically for script identification which introduces among other things a horizontal pooling layer.

Figure 5: Comparison to state-of-the-art on SIW-10. Error rates of the state-of-the-art CNN approach (MSPN), CNN baseline methods (CNN and LCC), intermediary layers as metric learning fed in to a Nearest Neighbour Classifier (Layer 1,Layer 2), and the proposed (Deep MLP).

In Fig. 5 a comparison of the proposed deep MLP with state-of-the-art methods for script detection in the wild can be seen. The proposed method achieved an error rate of 13.4% in classification accuracy which is significantly worst than the state-of-the-art 5.6%. Yet, this experiment allows an analysis in to the workings of the proposed deep MLP and the benefits of using k-NN on the intermediary activations. The initial SIW-10 dataset was augmented by the three new languages of SIW-13. An MLP trained on the SIW-10 was used to perform k-NN on the augmented dataset.

In Fig. 6

a confusion matrix of employing the first layer of the MLP with k-NN on the SIW-10 dataset augmented by the 3 languages of SIW-13. The overall accuracy is 83.7%, while for the initial 10 scripts it is 84.5%. While the second layer performed better than the first on the dataset for which the model was trained, 84.6%, it proved to be less generic than the first layer and got 77.3% when applied on all 13 classes.

The fact that the classes used to train the model have an average accuracy of 82.9% and the unseen classes have an average accuracy of 85.1% demonstrates the overall genericness of the first layer. As opposed to the CVSI experiments, when training on SIW data, consistently the second layer seemed to outperform the first layer after some epochs.

In Fig. 7 the training of the deep MLP can be seen and we can observe that the second layer converges towards the output layer while the first layer appears to be more independent. The extent to which layer 2 is domain specific while layer 1 is much more domain independent can be seen in table II, where layer 1 increases error rates when changing domains by 2.2 and 2.8 times, while layer 2 increases error rates by 3.4 and 8.9 times respectively.

MLP Train Retrieval k-NN on k-NN on
Dataset Dataset layer 1 layer 2
SIW-10 CVSI 94.8% 76.1%
CVSI CVSI 98.2% 97.3%
CVSI SIW-10 66.4% 47.7%
SIW-10 SIW-10 84.5% 84.6%
Table II: Crosss-domain use of deep MLP layers
Figure 6: Confusion matrix of k-NN on 13 scripts form the SIW datasets using the first layer of a deep MLP trained on 10 of them.
Figure 7: Training of the proposed deep MLP on the SIW-10 data.

Iv-C Visual Identification of Handwritten Language

(a) (b) (c) (d)
Figure 8: Samples from handwritten language identification. Samples of the same text and writer in Greek (a), English (b), French (c), and German (d).

The boundary between script and language identification is hard to define, as can be best exemplified in Latin derived languages. Visual language identification could also be perceived as fine-grained script classification. Yet distinguishing between such scripts before identification is required if one is to use language models for identification. In the case of handwriting identification, it becomes even more important, since identification frequently relies in word-spotting, which by definition needs a lexicon. In order to address this problem the ICDAR 2011 writer identification dataset was used

[17]

to estimate the language classification. This dataset consists of two paragraph-long texts translated to four languages: Greek, English, French, and German. In Fig. 

8 the same text written by the same writer in all four languages can be seen. Twenty six writers wrote all these samples which were then digitized and binarized. State-of-the-art methods report performances of over 95% accuracy in writer identification. As the dataset has never been used in the language identification context and visual handwritten-text language identification is a new problem to the authors knowledge, there is no state-of-the-art method. In order to make writer identification irrelevant, a 26-fold cross validation scheme was employed. All 8 samples contributed from every writer were used as testing samples, while all other samples were used for training.

Method Accuracy
Random Classifier 25.0%
SRS-LBP learning free pipeline 50.96%
SVM + SRS-LBP features 91.18%
Deep MLP + SRS-LBP features 92.78%
Table III: Visual Language Identification Accuracy
Figure 9: Confusion Matrix on Visual Language Detection with an ANN 26-fold cross-validation

In table III the performance of the proposed deep MLP along with baselines is presented. The dataset is totally balanced and has 4 classes, so an unbiased random classifier would be performing with 25%. The SRS-LBP unsupervised learning from [9] performs poorly, although significantly better than the random classifier. The same pipeline when applied on the same dataset for writer identification obtains 98.1%. This could be interpreted as an indication of how harder the Handwritten Visual Language Identification problem is compared to writer identification, at least for LBP features. The proposed method Deep MLP is exactly the same as the one described and used for CVSI, but instead of three pooling zones only global pooling is employed as the image has more than one text-line. Deep MLP achieves top performance 92.78%, although an SVM applied on the same features performs nearly as well. In Fig. 9 the confusion matrix of visual language classification can be seen. As one would expect, Greek is separated from the other three perfectly while English, French, and German have some confusions. It should be pointed out the dataset was acquired in Greece and all subjects would have Greek as their primary language and this might be helping distinguish it from the other three languages.

V Conclusions

V-a Discussion and Remarks

Several conclusions can be drawn from the experiments.

As the problem exemplified by CVSI was the primary focus for the development of the proposed method, the near state-of-the-art results validate the over-all strategy of using hand crafted texture features as the basis for script classification. Although all top achieving methods in CVSI used CNN, the proposed method demonstrates that hand crafted features can outperform them. It could be suggested that the fixed aspect ratio CNN requires, although addressed by all CNN methods in different ways, it is not as effective as the pooling performed on the LBP features. Video-text as a phenomenon is a lot more regular, occlusions and other complicated noise and distortions are very rare for such data; it stands to reason that the SRS-LBP features perform better in this case compared to text in the wild.

Although there is a lack of competing methods on the task, the visual language identification experiments demonstrated that detecting the language before recognition is possible. While the experiment was quite challenging, the fact that the dataset was balanced in every aspect might mean that the reported performance might not be the same in real world scenarios. The fact that all texts are of the same size, acquired under exactly the same conditions etc., allows for a model so big to be trained on a mere 208 samples. The model could be trained on more realistic data but then a much larger dataset would be required. The high performance demonstrated by the proposed pipeline should probably be attributed to the SRS-LBP features rather than the deep MLP as an SVM classifier on the same features provided results nearly as high.

When training the deep MLP on CVSI and SIW datasets, on each separate dataset the networks demonstrated consistency within the dataset. On SIW data, after some epochs the output layer outperformed k-NN on layer 2 which in turn outperformed k-NN on layer 1. On CVSI data, the behaviour was consistent. Applying k-NN on layer 1 outperformed the output layer which in turn outperformed k-NN on layer 2. The consistency in the difference between these two datasets suggests that video-text and scene-text script identification are different on how complicated they are and a deeper architecture might improve performance on scene-text. The use of early layers in deep MLP as a learned metric is probably the most interesting contribution in this paper. It should be noted, that during several training of the models in order to tune the architecture, the convergence of the curves was consistent for each dataset. This consistency could be attributed to the drop-out regularization. Drop-out regularization should also be credited with the fact that no over-fitting occurred on the trained data, since continuing training never increased the validation error in any significant manner. The experiment on the augmented SIW-10 dataset provided an experimental and quantitative validation of a hypothesis underlying presented work: In a sequence of fully connected layers, activations of earlier stages are more generic than later ones.

V-B Future Work

This paper presents a work in progress and several questions remain open. Further experimentation should be done in order to get a better understanding of the benefits and limits this metric learning has when compared to other methods such as Siamese networks. Although the performance in script identification in the wild was high, it was significantly lower than the state-of-the-art. Adapting a network architecture that performs better on unconstrained scene-text is one of the ways to follow on the reported work.

References

  • [1] D. Ghosh, T. Dube, and A. P. Shivaprasad, “Script recognition—a review,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 12, pp. 2142–2161, 2010.
  • [2] R. Unnikrishnan and R. Smith, “Combined script and page orientation estimation using the tesseract ocr engine,” in Proceedings of the International Workshop on Multilingual OCR.   ACM, 2009, p. 6.
  • [3] G. Zhu, X. Yu, Y. Li, and D. Doermann, “Language identification for handwritten document images using a shape codebook,” pattern recognition, vol. 42, no. 12, pp. 3184–3191, 2009.
  • [4] M. Ferrer, A. Morales, U. Pal et al., “Lbp based line-wise script identification,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on.   IEEE, 2013, pp. 369–373.
  • [5] A. Ul-Hasan, M. Afzal Zeshan, F. Shafait, M. Liwicki, and T. Breuel M., “A sequence learning approach for multiple script identification,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on.   IEEE, 2015, pp. 1046–1050.
  • [6] L. Mioulet, U. Garain, C. Chatelain, P. Barlas, and T. Paquet, “Language identification from handwritten documents,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on.   IEEE, 2015, pp. 676–680.
  • [7] N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein, “Icdar2015 competition on video script identification,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on.   IEEE, 2015, pp. 1196–1200.
  • [8] B. Shi, X. Bai, and C. Yao, “Script identification in the wild via discriminative convolutional neural network,” Pattern Recognition, vol. 52, pp. 448–458, 2016.
  • [9] A. Nicolaou, A. Bagdanov, M. Liwicki, and D. Karatzas, “Sparse radial sampling lbp for writer identification,” pp. 720–724, 2015.
  • [10] B. Shi, C. Yao, C. Zhang, X. Guo, F. Huang, and X. Bai, “Automatic script identification in the wild,” pp. 531–535, 2015.
  • [11] N. Otsu, “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285-296, pp. 23–27, 1975.
  • [12] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”

    The Journal of Machine Learning Research

    , vol. 15, no. 1, pp. 1929–1958, 2014.
  • [14] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah, “Signature verification using a “siamese” time delay neural network,”

    International Journal of Pattern Recognition and Artificial Intelligence

    , vol. 7, no. 04, pp. 669–688, 1993.
  • [15] C. Liu, “Probabilistic siamese network for learning representations,” Ph.D. dissertation, University of Toronto, 2013.
  • [16] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on.   IEEE, 2014, pp. 512–519.
  • [17] G. Louloudis, N. Stamatopoulos, and B. Gatos, “Icdar 2011 writer identification contest,” in Document Analysis and Recognition (ICDAR), 2011 International Conference on.   IEEE, 2011, pp. 1475–1479.