Script identification Sahare2017 ; singh2015offline is a key step in Optical Character Recognition (OCR) system. Script can be defined as a writing system consisting of a set of specific symbols and graphical shapes. Each script features specific attributes that distinguish it from other ones. Script identification is of utmost importance to understand handwritten documents automatically. Identification of handwritten script has gained prime importance in document image processing community; one of the reasons being global digitization of several handwritten scriptures and books. The basis of such script identification is the unique spatial relation among the strokes of a particular script, which helps in distinguishing the scripts from one another.
In multilingual singh2017handwritten environment, script recognition gains significant importance, since every handwritten text recognition system is language specific. In a country like India, where there are 12 official scripts111https://en.wikipedia.org/wiki/Languages_of_India, accessed on 20/02/2018, handwriting recognition becomes complicated. Hence, a robust script identification system is necessary to automate the process of text recognition ghosh2010script ; ubul2017script
. The critical challenge encountered in such a task is that the handwritten text suffers from inherent challenges due to free flow nature of handwriting, unlike machine generated text which have fairly uniform structure. The variation in writing style among the individuals and complex shapes of characters are some of the major hindrances for handwritten script identification. Figure1 shows some handwritten text documents written in different scripts.
Most of the existing literature for script identification/ recognition focuses on extracting linguistic and statistical features at the word or the line level ghosh2010script ; ubul2017script ; chanda2009word ; rajput2010handwritten . On the contrary, the proposed framework is based on the hypothesis that the character set of a particular script alone contains distinctive features for the purpose of script recognition at word level. To extract the feature we employ both offline (image representation) as well as online (stroke order representation) modalities of handwritten text simultaneously, to utilize their combined potential for script identification task. To ensure the same, we have designed a deep neural architecture which can be trained in an end to end manner. To our knowledge, the proposed method is the first work which introduces the idea of combining offline and online modality in a single deep network to maximize the script identification task. Along with this multi-modal deep network, we also stress on the models trained using character-level data only and test the script identification task for both character and word level data.
Only character level data in training stage provides us several advantages. Firstly, it is easier to collect large amount of character level data in lesser time, which saves both time and effort while preparing the dataset. Secondly, there are lesser number of unique characters compared to word level data, since several combinations of character sequence are possible for word level data. Thirdly, character level training is much lighter than that of word level data, thus making it ideal for employing in hand-held computing devices, e.g., mobile devices. It can be noted that character level data possesses a significant amount of information about the script which can be utilized to achieve word level script-identification performance.
The proposed deep network utilizes the information from both offline and online modalities jointly to leverage the respective advantages of both representations simultaneously. Online handwritten data namboodiri2004online comprises of a 1-D signal in the form of sequence of points, whereas offline data comprises of images. The reason behind combining online modality is that it holds both spatial and temporal information of the character. On the contrary, the offline word images do not contain the temporal information. From the literature, it is noted that due to availability of dynamic information such as temporal order of the points, the performance of online system is comparatively better than the offline one hamdani2009combining . However, offline data contains pixel level spatial information which partly complements the temporal and spatial information of online data.
Recently, the advent of convolutional deep architectures(gomez2017improving, ) has made it possible to improve the performance of traditional feature-based approaches significantly. We draw our inspiration of the multi-modal framework from popular deep learning frameworks in RGB-D data xu2017multi ; han2017cnns ; wang2015mmss ; asif2017multi
where it aims to utilize both RGB image and depth modality in a single deep network to get the advantage from both modalities simultaneously. Our proposed method is designed to handle both online and offline data in a single network rather than requiring different models. It eliminates the necessity of two different models for offline and online handwritten data individually. Our proposed framework first converts the online data into its offline form (and vice versa), thereafter feeding the data represented in two different modalities to the network, for learning the combined vector space for script identification.
In addition to this, we have proposed a novel conditional fusion scheme for script identification. The primary motivation is to include complementary information of offline and online modalities. For instance, if the original data is from online (offline) modality we convert it to its offline (online) equivalent, thereafter feeding both them into our deep network. During the multimodal fusion, we combine features from both the modalities adaptively, by imposing the condition that the original data is from online modality. Note that, the data comes in single modality, i.e. both modalities of that particular data are not present. Thus, we perform inter modality conversion to generate the other one. Use of both modalities simultaneously provides us with two major advantages. Firstly, it provides better script identification performance; and secondly, this makes it possible to design a single system which can work for both offline as well as online handwritten data. The underlying idea here is that our proposed conditional multimodal fusion mechanism would encourage the deep network to combine both the modalities considering their individual contributions.
The main contributions of the paper are as follows. Firstly, we propose a deep multi-modal framework for script identification which uses online and offline modalities of character level data to exploit their shared information. We thus have a single framework which can handle both modalities by converting the data from the present modality to the other modality and learns a joint embedding to utilize the information from both the modalities for better performance. Here, we also develop a novel conditional multi-modal fusion scheme for effective combination of both the modalities. In addition to that, it avoids the need of requiring two different models to handle data in two different modalities. The trained model is used to identify the script for both online and offline character as well as word level data. To the best of our knowledge, this is the first attempt to develop a single framework which works in parallel for both online and offline data for handwritten script identification. Secondly, our proposed method enjoys the advantage of using light weight training model for script identification since we are using character level data for training which has fewer combinations compared to word level data. Thirdly, we have done an exhaustive experiments using a number of different scripts from different modalities, online and offline, to justify the feasibility and competitive performance with other existing baseline methods.
The rest of the paper is organized in the following manner. In Section 2 we discuss related work developed for script identification, stroke recovery task and popular deep multimodal framework for different computer vision problems. In Section 3, the proposed framework for script identification has been described. In Section 4, we elaborate the experimental setup and discuss about results of the various experiments conducted to justify the significance and efficiency of our proposed method . Finally, conclusions and future directions are given in Section 5.
2 Related Work
Handwritten script identification:
Handwritten script identification is an important task in developing a multilingual handwriting recognition system where more than one script might be present. A comprehensive survey on script identification has been presented in ghosh2010script ; ubul2017script . Various works have been reported on printed document script identification chanda2009word
. However, script identification in handwritten documents is much more difficult compared to that in printed documents, due to varying handwriting styles of different individuals. A popularly explored approach in handwritten script identification is to extract linguistic and statistical feature followed by Support Vector Machine (SVM) as the classifier. Inhiremath2010script , such solution for word-level offline classification is proposed for Thai-Roman script classification. A technique for script identification in torn documents is proposed by Chanda et al. chanda2009word , in which Roman and Indic scripts are considered to evaluate the performance. They worked with rotation invariant Zernike features and the rotation dependent gradient feature, using PCA-based methods to predict orientation and then apply an SVM classifier at the character level. The results are calculated for the word level using majority voting at the character level, followed by prediction at the document level in a similar fashion. Pal et al. pal2007handwritten proposed a modified quadratic classifier using directional feature for recognition of off-line handwritten numerals of six popular Indic scripts. The bounding box of the numeric characters are divided into smaller blocks to capture the local information and directional feature is extracted at two levels, one from the original images and other one from the down sampled version of it through Gaussian Filtering. Moalla et al. proposed methods moalla2002extraction ; moalla2004extraction to separate out the Arabic text from the documents containing both Arabic and English words. In ferrer2014multiple
, Ferrer et al. has proposed a method for script identification in offline word images using word information index which estimates the amount of information included in a word. Different classifiers are trained using words with similar amount of information. During testing, the appropriate classifier is chosen based on the word information index of the query keyword. Regional local feature is studied in the workdhandra2007morphological . In rajput2010handwritten
, Rajput et al. proposed a method based upon the features extracted using Discrete Cosine Transform(DCT) and Wavelets along with KNN classifier to identify eight major scripts, namely, Devanagari, Gujarati, Gurumukhi, Kannada, Malayalam, Tamil, and Telugu at block level. Also, different textures feature have been explored for script identification task in various workshiremath2010script ; hangarge2010offline ; pal2012handwriting .
Neural network based solutions are also popular for script identification as discussed in ghosh2010script . In one of the earliest works, neural nets were employed for script identification in postal automation systemsroy2005system ; roy2005neural . In sankaran2012recognition , BLSTM is used for printed Devanagari Script recognition which uses five different features, namely, (a) the lower profile, (b) the upper profile, (c) the ink-background transitions, (d) the number of black pixels, and (e) the span of the foreground pixels. These features are fed to a Bi-RNN architecture using Connectionist Temporal Classification objective function which provides an improvement of more than 20% Word Error Rate (WER) compared to the best available OCR system during its publication year. In ul2015sequence , a 1D-LSTM architecture, with one hidden layer is used for script identification at the text-line level to learn binary script models, and the reported identification accuracy for English-Greek scripts is 98.19%.
Recently, Singh et al. singh2015word
proposed a word level script identification approach for handwritten images where they designed a set of 82 features using a combination of elliptical and polygonal approximation techniques. Authors considered a total of 7000 handwritten text words from six different Indic scripts - Bangla, Devanagarai, Gurumukhi, Malayalam, Oriya, Telegu and Roman script. They reported a maximum accuracy of 95.35% using Multi Layer Perceptron(MLP) classifier. Handwritten numeral script identification has been proposed by Obaidullah et al.obaidullah2015numeral .
Along with these, a few deep learning based approaches shi2016script ; gomez2017improving ; mei2016scene for script identification in scene images have appeared in the literature recently. However, the potential of deep neural network for handwritten script identification has not been explored completely. In shi2016script
, local deep features are extracted using a pretrained CNN model and discriminative clustering is carried out to obtain the mid level representation by learning a set of discriminative patterns from extracted local features. Following this, the deep features and the mid-level representations are jointly optimized in a deep network and their proposed model is termed as Discriminative Convolutional Neural Network(DisCNN). Gomez et al.gomez2017improving used the ensembles conjoined networks in order to learn from the stroke patches along with their relative importance.
Online Trajectory Retrieval:
Restoring of temporal order from offline handwriting has been worked since long doermann1995recovery ; boccignone1993recovering . Stroke and sub-stroke properties were utilized and authors provided a a taxonomy of local, regional and global temporal clues which were found to be beneficial for stroke recovery problem. In elbaati2009temporal
, Elbaati et al. proposed an approach to recover the stroke by segmenting the offline word image into strokes and labeling all the edges as successive parts of the strokes. Then, a Genetic Algorithm is applied to optimize these strokes and produce the best possible stroke order. An application of the above method is used inhamdani2009combining
which combines offline and online data for Hidden Markov Model(HMM) based Arabic handwriting recognition. The offline features and temporal stroke order from online data are complementary in nature, which in combination improve the recognition accuracy of the framework. Inkato2000recovery , Kato et al. proposed a stroke recovery technique which works for single stroke characters. This system labels each edge of the word image and connects them based on a predefined algorithm without any learning method.
Deep Multimodal learning:
is a very popular concept in computer vision community in order to combine information from more than one sources. However, in document image analysis, more specifically for handwriting recognition task, there are hardly any application of multi-modal framework. In recent time, due to advancement of deep learning technology different modalities are combined for better accuracy in different problems like scene understanding, RGB-D object detection etc. In the task of Image Captioningkarpathy2015deep and Visual Question Answering ilievski2017multimodal , the language feature and image feature are combined using deep neural network architecture. In RGB-D data, both image and depth modalities are explored for various tasks like Object Recognitionwang2015mmss , Scene classificationzhu2016discriminative , Object Detection xu2017multi . Given a vast literature for multi-modal learning, we attempt to use a deep multi-modal framework to explore the joint information from both offline and online handwritten data for script identification task.
3 Proposed Framework
The proposed approach can be divided into two steps. In the first step, we extract the data from original modality to its equivalent opposite modality. In the next step, the data from both the modalities are considered simultaneously as input to deep neural network to combine their information. Note that only character level data is used to train the network. During testing, it can identify the script for both character and word level data from both modalities. In our work, we design Convolutional-LSTM architecture where Convolutional Network intends to extract more robust sequential feature from the data and LSTM module captures the contextual information of the sequence for better performance. An overview of our proposed multimodal framework is given in Figure 3. In this section, we describe our different modules of our framework serially.
3.1 Inter Modality Conversion
Inter modality conversion is a key step in our framework. Since our proposed deep network takes the offline-online pair as the input, a modality-conversion is required in order to get the opposite modality from the input data.
Online to Offline Conversion:
Offline to online conversion of handwritten data has been perfomed using (nel2005estimating, ). First, the skeleton image of the handwritten text is extracted through a thinning process to extract a parametric curve. In order to perform a comparison between a static image skeleton and a dynamic exemplar, translation is performed until the centroids of the two become aligned and following this, it is converted into a static image. The authors have thickened the static image skeleton and also the image obtained from the dynamic exemplar to a line width of approximately five pixels. After this, matching between the two images is carried out followed by trajectory extraction. The method that has been used for determining the pen trajectory from a static and a normalized image by making use of a Hidden Markov Model. For the problem of stroke order recovery, the sequence of states of the HMM are used to describe the sequence of pen positions as the image is produced. The HMM model is built from the skeleton of the static image. The dynamic exemplar is matched to the static image by making use of the Hidden Markov Model. An example of offline to online conversion is shown in Figure 4.
Offline to Online Conversion:
Conversion of online data to offline is a trivial one. Online data consists of consecutive co-ordinate points representing the flow of writing. To convert the online data to its offline equivalent, we first define an empty image matrix based on the difference of maximum and minimum coordinate values. Thereafter, we mark those pixel positions of the empty image matrix based on the online coordinate points and join consecutive points serially. The process generates the skeleton image of the handwritten data. Following this, a morphological thickening operation is performed in order to make the equivalent offline word image similar to the real offline word image data. Online to offline conversion is shown in Figure 5
3.2 Network Architecture for Online modality
Online modality of handwritten data consists of sequential co-ordinate points representing the flow of writing. One of the naive ways could be to feed these sequential
co-ordinate points to a LSTM module and get the state of the last time step as the final feature representation of the online data. However, in this proposed approach, we have designed a Convolutional LSTM architecture for the online stream of our network. The key idea of using CNN is to achieve a certain extent of shift, scale and distortion in-variance. For a 2D image, the local connectivity of CNN learns the correlation among neighbouring pixels. The objective of using CNN on the sequential co-ordinate points before feeding it to the LSTM module is to learn the correlation among the neighbouring co-ordinate points. In addition to that, it intends to achieve a certain degree of distortion or shift invariance which may arise due to free flow of writing of different individuals, and during capturing of co-ordinate points by sensors. Hence, convolutional network will convert theco-ordinate points into a high dimensional space by incorporating the spatial correlation among nearby sequential points in order to make it less variant to different individuals’ handwriting and distortions.
For every online data sample, there are N sequential points represented by two values, co-ordinate value and - co-ordinate value for each point. However, in order to formulate the 1-D convolution over these points, we consider these points as 1-dimensional signal with 2 channels, i.e. . Then we employ a 1-D convolution with filter ( filters with dimension of ). We restrict the length of the filter to 5. is the number of channels or feature maps in the input. For example, for the first convolution layer since the input is 1-dimensional 2 channel having a length N. The output of 1D-convolution is given by
1-D convolution operation follows the same rule as that of 2D convolution with a small difference, which is that the filter used here is of size
and it strides over only one direction(here time direction). A graphical illustration of 1D-convolution over the online coordinate points is shown in Figure6. In order to introduce the invariance against free flow writing of different individuals and noisy acquiring of data from sensors, we have added one maxpooling operation along the time direction after the second convolution layers. The window size of maxpooling operation is , hence it reduces the number of data points to half.
However, note that, online data represents the flow of writing with time. Hence, it is observed that more than one maxpooling operation reduces the performance. Inspired from the network of engelmann2017exploring , we have used global maxpooling lin2013network
operation to obtain the global feature vector which contains the holistic information of all the co-ordinate points. This global feature vector is concatenated with the feature map of every data point and passes through to last convolution layer. This convolution layer intends to combine each point wise feature with a global feature adaptively. In our online stream, we have used six convolution layers with the number of filters being 32, 64, 128, 256, 256, and 512 respectively. Hence, we obtain an output tensor of size, which thereafter is to be fed to the LSTM module. Following this, we create a custom ‘Map-to-Sequence’ layer as the bridge between convolutional layers and recurrent layers as mentioned in shi2016end . This ‘Map-to-Sequence’ layer converts the 3-dimensional tensor to a 2-dimensional time distributed feature representation of size for the online data points.
3.3 Network Architecture for offline modality
The online data was already time distributed, containing successive co-ordinate points which represent the flow of writing. In contrast to this, the offline data does not have any time information. Hence, in order to feed this to a LSTM module, we need to convert the offline word image into sequential feature representation. In handwriting recognition, one of the popular approaches is to use sliding based roy2016hmm ; BHUNIA201812 feature extraction. However, the handcrafted features has its own limitation. To solve this problem, we have used a convoolutional neural networkshi2016end in order to extract the sequential feature from offline images which thereafter is fed to the LSTM module. There may exist several cases where character from one script may resemble the character of a different script. For those instances, it will be beneficial to look at the contexts of those ambiguous characters. Hence, we employ a Convolutional-LSTM architectureshi2016end for the script identification task where convolutional network is used to convert the offline image into its sequential deep convolutional feature representation. Then the LSTM module is used to capture the contextual information with in a sequence. It can take input images of arbitrary width which is one of the major requirements of our framework since we are training the network using character level data and predict the result for both character and word level data as well. Word level data usually has much longer width and there is large variation in the length depending on the number of characters present in that word. Resizing the width to a fixed size is not a good choice since it distorts the word image and it may eliminate some good script specific information. However, it is needed to scale all the images to a fixed height to feed them in the network keeping the aspect ratio same.
Generation of each feature vector of a feature sequence is done in a left-to-right manner on the feature maps, taken column wise. This denotes that the the concatenation of the i-th columns of all the maps gives rise to the i-th feature vector. As per the architecture, the width of each such column is maintained at one pixel. Features are translation invariant due to the fact that layers of convolution, max pooling and element wise activation function operate in local neighborhood. Hence, each column of the feature maps actually maps to a specific area of the original image. Such regions are found in the same sequence to their corresponding columns on the feature maps from left to right. Each vector in the feature sequence can thus be regarded as a local image descriptor. Figure7 graphically shows the process of feature sequence generation using convolutional architecture shi2016end . Our convolutional architecture for feature sequence extraction is composed of seven convolutional layer. The major change we have done in our network is to include a global maxpooling lin2013network operation to get a global feature vector for the sequence which is concatenated with every left-to-right feature map and are fed to one last convolutional layer before converting it to final feature sequence using ‘Map to Sequence’ operation. The primary objective of including the global feature in our method is that besides gathering the local information, we can also consider the holistic representation of the entire image.
3.4 Conditional Multi-modal Fusion
After obtaining the time distributed feature sequence from both offline and online stream of every data sample using 1-D and 2-D Convolutional Network, we feed those sequential features to two different LSTM (Long Short Term Memory)modules for each modality. Traditional RNN suffers from the problem of vanishing and exploding gradient. To overcome these drawbacks, a different type of RNN is used known as LSTM (Long Short Term Memory). A memory cell along with three multiplicative gates constitute an LSTM. These gates are called input gate, forget gate and output gate. From the conceptual point of view, the past contents are stored in memory cells while the input and output cells are used to enable the cell to store contents for a long period of time. The forget gate is used to clear the memory in the cell. The main advantage of an LSTM is its ability to handle better long term dependency. The core idea of using LSTM module is to extract the feature from cell state of last time step after LSTM has seen the complete sequence of the offline or online feature sequence. This leads to the consideration of sequential relation among all feature vectors of a sequence. This is highly expected in a task like script identification where two different scripts might have a few related characters which possess a certain extent of similarity. However, using the sequential approach, we can avoid this confusion by considering global representation including the sequential relation among successive feature vectors of a sequence. For combination of offline and online information, we proposed a multimodal conditional fusion method. Simple concatenation of the features of the modalities results redundant information decreasing the performance of the model. Also, it is necessary to take the most relevant information from the two modalities for correctly classifying the script. Thus, we used a novel fusion technique that dynamically assigns weights across the modalities representations. It learns the correlations between offline and online modality along with their adaptive contribution in fusion method.
Let the feature from the final time step of the LSTM network be and with size for online and offline modality respectively. At first, we concatenate and to obtain of size .
The concatenated feature representation is conditioned on a 2 bit binary vector representing the original modality of the input data. It is important to let the model know the actual form of the input data, whether it is online data or offline data. It allows the model to give the priority to the original modality adaptively in calculating the final feature representation. After feeding to the concatenated representation we get a feature vector of size . Thereafter, we pass it through a fully connected layer() of weights . The primary objective of such fully connected layer is to learn the correlation between the two modalities in order to assign their respective weightage accordingly. The output of this fully connected layer is with size
. Finally, the sigmoid function outputs the weight parameters. Using equation4 and 5 we get the weights and
. These weights are element-wise multiplied with their corresponding modality representations. The final feature vector is obtained by adding these two feature representation. Then, a fully connected layer is used which has the same number of neuron as the number of classes. A softmax layer outputs the probability distribution of the script over the classes. The conditional fusion is carried out by the following operations.
3.5 Implementation Details
Our proposed multimodal network uses character level data during training and it predicts the script identification result for both character and word level data. Let be the online-offline modality pair for a particular sample data which is to be fed as the input to our network. It is to be noted that only one of the modalities, i.e. either or is present originally, and we convert the other one using the method mentioned in section 3.1. are the two given labels to train the network in a supervised manner. Here, is the corresponding script label and is the original modality from which the original data sample was fetched. is the extra supervision we use to impose a condition during multimodal fusion. During training, is used to calculate the cross-entropy loss for classification to train the network through back-propagation. During testing, the network predicts the identified script as output. However, is present during both training and testing for conditional fusion of two different modalities.
The network architecture of our framework is shown pictorially in Figure 8
. We have included one global average pooling operation in both offline and online stream networks in order to capture the holistic information about the data sample. The architecture for offline stream consists of 7 convolutional layers and 4 maxpooling layers. For the 3rd and 4th maxpooling layers, the filter size was fixed at 2x1, in order to get the feature maps with larger width, thus generating a larger feature sequence, which was found to be beneficial for capturing the spatial dependency among characters of words. We normalize every offline data to a height of 32 keeping the aspect ratio same. In order to feed the character level offline data in batches during training, we resize every character to a size of 32x32. However, we can feed offline images of arbitrary width to keep the aspect ratio constant during testing, but only one at a time. This is expected because word images(during testing) usually have much longer width, i.e. much higher aspect ratio compared to single characters. On the contrary, for online data it has no such font size limitation, since it is already time distributed. The architecture of online stream consists of 6 convolutional layer and one maxpooling operation. In order to accelerate the training process, we have added two batch normalization layers in both of our offline and online stream networks. Next, we have used two layers LSTM network with 512 hidden LSTM units. Hence, K equals to 512 in our framework. The final feature for offline and online modality is extracted from the cell state of the last time step of two LSTM modules, after which multimodal conditional fusion is carried out.
We have implemented our complete system using Python and Tensorflow framework in 2.50 GHz Intel(R) Xeon(R) CPU, 32GB RAM and an NVIDIA Titan-X GPU. The weights of the model are initialized according to the Xavier initializer. All convolution and fully connected layers use Rectified Linear Units (ReLU). The training is carried out using Stochastic Gradient Descent algorithm with a momentum of 0.9 and learning rate 0.01. The network converges after 30K iterations with a batch size of 32. The learning rate is multiplied by 0.1 when the validation error stops decreasing for enough number of iterations. The weight decay regularization parameter is set to.
In this section, we report the performance of our script identification framework. We first introduce the datasets used for our study and then present the detailed script identification performance along with different baseline methods, error analysis and discussions.
As per our findings, there exists no such standard datasets for handwritten script recognition. Here, we have collected various publicly available word and character recognition datasets BHUNIA201812 ; roy2016hmm of different scripts to prepare our required database for script recognition. In our experiments, we have considered a total of 7 scripts for the performance evaluation, namely, Devanagari, Bangla, Odia, Gurumukhi, Tamil, Telegu and English. Among these scripts, Devanagari, Bangla, Gurumukhi and Odia are descended ghosh2010script from the common ancestor script in the Brahmi script family. There exist a good extent of similarity between Bangla and Devanagari, Devanagari and Gurumukhi as mentioned in BHUNIA201812 . Similarly, Tamil and Telegu are two south Indian scripts. On the other side, English is a global language which is a medium of communication for different parts of the world. Most of the documents are bi-script which contains one of the regional languages with English. Hence, our selection of the scripts for performance evaluation is based on the intention to make the task of script identification more difficult. Table 1 gives the detail of our dataset used for script recognition. All the experiments for script recognition have been done in a 10 fold cross validation mode with 7:2:1 training, validation and testing. By this, 70% data of dataset was used for training, 10% data for validation and 20% data for testing.
|Scripts||Number of Samples|
|Character Level||Word Level|
4.2 Different Baselines
As mentioned earlier, there is no such earlier framework for handwritten script identification using multimodal deep network which can perform for both offline and online data simultaneously using a single model. However, one of the naive approaches is to convert the data from either modality to its equivalent opposite modality, and feed the required modality to the framework based on the modality it has been trained on. For instance, if a framework has been trained using offline handwritten data, and we have online handwritten data for testing. In this case, the naive approach is to convert online data to its offline equivalent and test using the model trained from offline handwritten data. However, there is a major limitation of using such naive approach. Although, handwritten data can be converted between two modalities, the data distribution of converted data is not similar to the real one, thus limiting the performance. In order to justify the superiority of our method, we evaluated both in-modality and cross-modality performance for all the baseline methods. Another contribution of our framework is that our framework can be trained with light weight character level data and can achieve performance as that of traditional way of training a script identification model using word level data. Hence, we report the performance of our framework for both character level and word level data for training. To perform a fair comparison between word level and character level data for training, we use the nearly equal number of word and character level data for training. In Table 1, we have mentioned the number of data for word level and character level data from offline and online modalities are present; and it shows the number of sample is nearly with in a same range. For every experiment, cross modality(training from online data and testing on offline data or vice-versa) or cross level(i.e. training from character level data and testing on word level data or vice-versa), we use 7 fold data of a particular level or modality for training, and testing and validation have been done on the data of other modality or level with 2 and 1 fold each respectively.
To compare our proposed framework, we have defined a few base line methods based on deep neural network architecture. We have justified the limitations of every baseline with respect to our proposed framework. All the base line methods are defined for single modality data in order to justify the improvement in performance, we achieve due to our multi-modal framework. Also, the our proposed novel multi-modal fusion method is compared with different traditional multimodal fusion methods in section 4.3. The base line methods using different traditional classifiers and hand-crafted features are detailed in section 4.5 separately. For the first two baseline methods, we have just sliced the offline and online stream of our multimodal network into two different baselines for online and offline data respectively. The performance comparison with these two baseline justifies the necessity of designing a multimodal deep framework.
For this baseline, we use 1D-convolutional-LSTM network for online handwritten script identification. This is the same configuration as used for our online stream network and is only trained from online data. The softmax classification layer is used at the output of last time step. Although the performance is competitive in case of online data, the major limitation is, the cross-modality performance is restricted since the network is unaware of the data distribution of offline data.
For this baseline, we use 2D-convolutional-LSTM network for offline handwritten script identification identification following the same architecture as mentioned for offline stream of our network. This has been trained only from the offline images. Here also, the main limitation is that it does not perform well for cross-modal data.
One of the important contributions in our online stream network is the use of 1D-convolutional network over the online coordinate points in order to learn the structural correlation of among neighboring pixel points. Hence, we define our third base line in order justify the improvement, we achieve due to use of this 1D-convolutional network. Here, we directly feed the coordinate points into the LSTM network for script identification and evaluate the identification performance. The architecture and setup for different hyper parameters are kept same as mentioned in section 3.5.
Our framework is trained from data in paired modality, where the original modality is present from both offline and online modality with equal distribution in the training data. We have evaluated the performance our model for both real online and offline data individually. We found that our model generalizes well for both online and offline data with no significant change in the accuracy. The results are reported in Table 3. However, we have evaluated the performance using only one modality as the source of data for training. It has been observed that the performance of script identification decreases in this case compared to training the network using both the modalities as the source. Hence, we conclude that our proposed architecture generalizes well for both the modalities when it is trained from both online and offline modality. A comparative study has been shown in Fig. 11
. The confusion matrix for both character and word level training is shown in Figure12.
4.3 Comparative study with different multimodal fusion methods
One of most popular application of multimodal fusion approaches is Visual Question Answering(ilievski2017multimodal, ), where image and language feature representation are combined. Here, we also combine the offline and online feature representation for script identification using a conditional multimodal fusion method. We compare our multimodal fusion method with different traditional fusion methods popular in the literature. Lets denote the feature representation of offline and online modality as and , both of which is of dimension .The traditional approaches used in our comparsion are as follows: Firstly, One of the approaches is to concatenate these two features and and feed it to a fully connected and a softmax layer for classification. Secondly, and can be added or multiplied element wise followed by a fully connected and a softmax layer for classification. Thirdly, we have considered the outer product between and and bilinear pooling (lin2015bilinear, ) followed by a fully connected and a softmax layer for final classification. Fourthly, we use Multimodal Compact Bilinear Pooling with pooling dimenion 4K for multimodal fusion. More details about Multimodal Compact Bilinear(MCB) Pooling can be found in (gao2016compact, ). We have also evaluated the performance of our multimodal fusion method with out using conditional fusion. The results for different multimodal fusion strategies are reported in Table 4.
|Training-Testing Pair Combination|
4.4 Analysis of Character Level Training
In this paper, we hold the opinion that character level training data is sufficient enough to achieve state-of-the-art script identification performance as that of training from word level data. In addition, character level training posses a few advantages as mentioned in section 1. In Figure 13, a comparative study has been performed between character and word level training with respect to varying number of training sample. From the results, it is justified that comparatively fewer character level training data is sufficient to achieve the optimum performance compared to word level training data. Word level training outperforms character level training method after a certain number of training samples. However, the difference between character and word level training is almost negligible with some added advantages on the side of character level training.
4.5 Comparative study with different traditional classifiers
In this paper, we have considered convolutional layers stacked with LSTMs for due to their superior performance reported in recent literature. It is a deep multimodal network which can be trained in a end to end manner using back propagation and it takes input for both offline and online modality of the data. In recent literature, Deep Neural Network has achieved a great superiority over the traditional classifiers and handcrafted features. Most of the previous methods for handwritten script identification are based on this traditional classifiers and handcrafted features. However, since there is no standard dataset, we can not compare our deep learning based system with those methods. Hence, we employed some baseline methods using some popular traditional classifiers and popular handcrafted features. We name these methods as Traditional Approaches(TA) and report the results in our 7 handwritten scripts dataset. We show the performance using both character level and word level data as training. Also, we report the cross modality script identification accuracy. However, training has been done from single modality, i.e. no multimodal combination, for these traditional methods. The different baseline approaches considered in our experiments are described in the following subsections.
4.5.1 Performance on Offline Data
For offline word images, we have used PHOG(Pyramidal Histogram of Oriented Gradient) and LBP(Local Binary Pattern) for feature extraction. PHOG is gradient based feature which has been utilized in sliding window based handwriting recognition task roy2016hmm ; BHUNIA201812 . LBP is a texture descriptor which has been further used in different tasks like, facial expression recognition (moore2011local, ), handwritten word spotting in historical documents (dey2016local, ) etc.
For this baseline, we have used two popular state-of the art traditional classifiers (fernandez2014we, )
, SVM(Support Vector Machine) and Random Forest, for the script identification task along with PHOG and LBP features respectively. Both SVM and Random Forest have been studied in various classification problems extensively in both Computer Vision and Document Image Analysis communities. Here, we have extracted the handcrafted feature using PHOG and LBP respectively, and evaluate the classification performance using SVM and Random Forest classifiers respectively. The results are reported in Table5 as TA_1. It is to be noted that using this baseline, it is not possible to predict the script at word level using character level training data, since SVM or Random Forest cannot handle the sequence based classification. Hence, the results are reported accordingly.
In this baseline, we have explored the sequence-based script identification approach using two traditional sequential classifiers HMM and HCRF respectively. For sequential feature extraction from word or character level images, a sliding window moves from left to right of the image and PHOG or LBP feature is extracted from each sliding window, which denotes the feature at single time step. This type of sliding window based classification approach has been used in music score writer identification (roy2017hmm, ). As we are dealing with sequence based classification approach in this baseline, it is possible to use character level data for training in order to predict the script at word level. The results are reported in Table 5 as TA_2.
Here, we use the same sliding window based feature extraction approach as used in TA_2, the only difference is that we use 2 layers LSTM as the classifier. Performance of different handcrafted features with LSTM has been studied in (chherawala2016feature, ) for offline handwriting recognition. Hence, we employ a similar framework for the script identification task, and name it as TA_3.
4.5.2 Performance on Online Data
Online data contains the successive coordinate points representing the trajectory of pen’s movement. For online handwriting recognition, one of the most popular handcrafted feature descriptor is NPEN++ (jaeger2001online, ). We have considered this feature in our traditional baseline method for handwritten online script identification. Here, five different features namely, Curliness (CR), Writing direction (WD), Linearity (LR), Slope (SP), and Curvature (CV) are extracted from the sequence of coordinate points and are concatenated to form the final feature descriptor. More details about these features can be found in (jaeger2001online, ).
In this baseline, we have used the NPEN++ feature along with sequential classifier HMM and HCRF for each case. HMM and NPEN++ feature have been used in online handwriting recognition in (jaeger2001online, ). Hence, it is reasonable to evaluate the performance of these combinations for online handwritten script identification task. The results are reported in Table 6 as TA_4.
4.6 Error Analaysis
Inter modality conversion is one of the main crucial steps in our framework, since our proposed architecture takes offline-online modality pair of a data sample as the input. Although the conversion of online to offline modality is a trivial and simple one, the recovery of stroke information form offline data is a bit challenging due to free flow nature of handwriting by different individuals. For this task, we have adopted the method in (nel2005estimating, ). However, during our experimental analysis, we have observed that the offline to online conversion algorithm fails to recover proper stroke sequences in some specific cases. Figure 14 shows some examples, where the the our adopted algorithm fails to recover the expected stroke sequence in some specific cases. One of the main reasons of this problem is skeletonization error that appears as unwanted skeleton branches with incorrect angles due to uneven thickness of the handwritten data and surface noise. It is also observed from the second image of Figure 14 that the algorithm tends to miss strokes at the junction points due to the presence of the Matra which is very common feature among Indic Scripts. This problem can be solved to some extent by Matra removal as proposed by (roy2016hmm, ).
5 Conclusions and Future Work
In this paper we have proposed a new method for script identification which has provided us with satisfactory results. Handwritten text present in either modality, online or offline has been received and the absent modality has been recreated using inter modality conversion. After such recreation, both modalities had been fed in pair into a deep neural network. This designed neural network uses both sets of information from both modalities to employ multimodal fusion thus combining the features adaptively. Two significant achievements obtained are the designing of one single training model which encompasses the training of both modalities of data. Secondly, the features of two modalities are combined to produce more accurate results.
As evident from the results, our method performs better than almost every other state-of-the art method for handwritten script identification. A few drawbacks include incomplete conversion of modality to the other, but the conditions existent for such cases are rare. Our future work would include fine tuning our proposed deep network architecture and including more Indic scripts to increase the scope of application of our method. One of the most promising future research directions would be to design a single deep model for offline and online handwriting recognition in a single deep neural network through exploring information from both the modalities.
- (1) P. Sahare, S. B. Dhok, Script identification algorithms: a survey, International Journal of Multimedia Information Retrieval 6 (3) (2017) 211–232.
- (2) P. K. Singh, R. Sarkar, M. Nasipuri, Offline script identification from multilingual indic-script documents: a state-of-the-art, Computer Science Review 15 (2015) 1–28.
- (3) P. K. Singh, S. Das, R. Sarkar, M. Nasipuri, Handwritten mixed-script recognition system: A comprehensive approach, in: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, Springer, 2017, pp. 787–795.
- (4) D. Ghosh, T. Dube, A. Shivaprasad, Script recognition—a review, IEEE Transactions on pattern analysis and machine intelligence 32 (12) (2010) 2142–2161.
- (5) K. Ubul, G. Tursun, A. Aysa, D. Impedovo, G. Pirlo, T. Yibulayin, Script identification of multi-script documents: A survey, IEEE Access 5 (2017) 6546–6559.
- (6) S. Chanda, U. Pal, O. R. Terrades, Word-wise thai and roman script identification, ACM Transactions on Asian Language Information Processing (TALIP) 8 (3) (2009) 11.
- (7) G. Rajput, H. Anita, Handwritten script recognition using dct and wavelet features at block level, IJCA, Special issue on RTIPPR (3) (2010) 158–163.
- (8) A. M. Namboodiri, A. K. Jain, Online handwritten script recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (1) (2004) 124–130.
- (9) M. Hamdani, H. El Abed, M. Kherallah, A. M. Alimi, Combining multiple hmms using on-line and off-line features for off-line arabic handwriting recognition, in: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, IEEE, 2009, pp. 201–205.
- (10) L. Gomez, A. Nicolaou, D. Karatzas, Improving patch-based scene text script identification with ensembles of conjoined networks, Pattern Recognition 67 (2017) 85–96.
- (11) X. Xu, Y. Li, G. Wu, J. Luo, Multi-modal deep feature learning for rgb-d object detection, Pattern Recognition 72 (2017) 300–313.
- (12) J. Han, H. Chen, N. Liu, C. Yan, X. Li, Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion, IEEE Transactions on Cybernetics.
- (13) A. Wang, J. Cai, J. Lu, T.-J. Cham, Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1125–1133.
- (14) U. Asif, M. Bennamoun, F. Sohel, A multi-modal, discriminative and spatially invariant cnn for rgb-d object labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence.
- (15) P. Hiremath, J. D. Pujari, S. Shivashankar, V. Mouneswara, Script identification in a handwritten document image using texture features, in: Advance Computing Conference (IACC), 2010 IEEE 2nd International, IEEE, 2010, pp. 110–114.
- (16) U. Pal, N. Sharma, T. Wakabayashi, F. Kimura, Handwritten numeral recognition of six popular indian scripts, in: Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, Vol. 2, IEEE, 2007, pp. 749–753.
- (17) I. Moalla, A. Elbaati, A. Alimi, A. Benhamadou, Extraction of arabic text from multilingual documents, in: Systems, Man and Cybernetics, 2002 IEEE International Conference on, Vol. 4, IEEE, 2002, pp. 5–pp.
I. Moalla, A. M. Alimi, A. Benhamadou, Extraction of arabic words from multilingual documents, in: Proc. Of Artificial Intelligence and Soft Computing Conference (ASC2004), 2004.
- (19) M. A. Ferrer, A. Morales, N. Rodríguez, U. Pal, Multiple training-one test methodology for handwritten word-script identification, in: Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, IEEE, 2014, pp. 754–759.
- (20) B. Dhandra, M. Hangarge, Morphological reconstruction for word level script identification, International Journal of Computer Science and Security (IJCSS), Computer Science Journal Press 1 (1) (2007) 41–51.
- (21) M. Hangarge, B. Dhandra, Offline handwritten script identification in document images, Int. J. Comput. Appl 4 (6) (2010) 6–10.
- (22) U. Pal, R. Jayadevan, N. Sharma, Handwriting recognition in indian regional scripts: a survey of offline techniques, ACM Transactions on Asian Language Information Processing (TALIP) 11 (1) (2012) 1.
- (23) K. Roy, S. Vajda, U. Pal, B. B. Chaudhuri, A. Belaïd, A system for indian postal automation, in: Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, IEEE, 2005, pp. 1060–1064.
- (24) K. Roy, U. Pal, B. Chaudhuri, Neural network based word-wise handwritten script identification system for indian postal automation, in: Intelligent Sensing and Information Processing, 2005. Proceedings of 2005 International Conference on, IEEE, 2005, pp. 240–245.
- (25) N. Sankaran, C. Jawahar, Recognition of printed devanagari text using blstm neural network, in: Pattern Recognition (ICPR), 2012 21st International Conference on, IEEE, 2012, pp. 322–325.
- (26) A. Ul-Hasan, M. Z. Afzal, F. Shafait, M. Liwicki, T. M. Breuel, A sequence learning approach for multiple script identification, in: Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, IEEE, 2015, pp. 1046–1050.
- (27) P. K. Singh, R. Sarkar, M. Nasipuri, D. Doermann, Word-level script identification for handwritten indic scripts, in: Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, IEEE, 2015, pp. 1106–1110.
- (28) S. M. Obaidullah, C. Halder, N. Das, K. Roy, Numeral script identification from handwritten document images, Procedia Computer Science 54 (2015) 585–594.
- (29) B. Shi, X. Bai, C. Yao, Script identification in the wild via discriminative convolutional neural network, Pattern Recognition 52 (2016) 448–458.
J. Mei, L. Dai, B. Shi, X. Bai, Scene text script identification with convolutional recurrent neural networks, in: Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, 2016, pp. 4053–4058.
- (31) D. S. Doermann, A. Rosenfeld, Recovery of temporal information from static images of handwriting, International Journal of Computer Vision 15 (1-2) (1995) 143–164.
- (32) G. Boccignone, A. Chianese, L. P. Cordella, A. Marcelli, Recovering dynamic information from static handwriting, Pattern recognition 26 (3) (1993) 409–418.
- (33) A. Elbaati, M. Kherallah, A. Ennaji, A. M. Alimi, Temporal order recovery of the scanned handwriting, in: Document Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, IEEE, 2009, pp. 1116–1120.
- (34) Y. Kato, M. Yasuhara, Recovery of drawing order from single-stroke handwriting images, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (9) (2000) 938–949.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep learning, in: Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
N. Srivastava, R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in: Advances in neural information processing systems, 2012, pp. 2222–2230.
- (37) A. Karpathy, L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
- (38) I. Ilievski, J. Feng, Multimodal learning and reasoning for visual question answering, in: Advances in Neural Information Processing Systems, 2017, pp. 551–562.
- (39) H. Zhu, J.-B. Weibel, S. Lu, Discriminative multi-modal feature fusion for rgbd indoor scene recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2969–2976.
- (40) E.-M. Nel, J. A. Du Preez, B. M. Herbst, Estimating the pen trajectories of static signatures using hidden markov models, IEEE transactions on pattern analysis and machine intelligence 27 (11) (2005) 1733–1746.
- (41) F. Engelmann, T. Kontogianni, A. Hermans, B. Leibe, Exploring spatial context for 3d semantic segmentation of point clouds, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 716–724.
- (42) M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint arXiv:1312.4400.
- (43) B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE transactions on pattern analysis and machine intelligence.
- (44) P. P. Roy, A. K. Bhunia, A. Das, P. Dey, U. Pal, Hmm-based indic handwritten word recognition using zone segmentation, Pattern Recognition 60 (2016) 1057–1075.
- (45) A. K. Bhunia, P. P. Roy, A. Mohta, U. Pal, Cross-language framework for word recognition and spotting of indic scripts, Pattern Recognition 79 (2018) 12 – 31.
- (46) T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained visual recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457.
- (47) Y. Gao, O. Beijbom, N. Zhang, T. Darrell, Compact bilinear pooling, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 317–326.
- (48) S. Moore, R. Bowden, Local binary patterns for multi-view facial expression recognition, Computer Vision and Image Understanding 115 (4) (2011) 541–558.
- (49) S. Dey, A. Nicolaou, J. Llados, U. Pal, Local binary pattern for word spotting in handwritten historical document, in: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2016, pp. 574–583.
- (50) M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems, Journal of Machine Learning Research 15 (1) (2014) 3133–3181.
- (51) P. P. Roy, A. K. Bhunia, U. Pal, Hmm-based writer identification in music score documents without staff-line removal, Expert Systems with Applications 89 (2017) 222–240.
- (52) Y. Chherawala, P. P. Roy, M. Cheriet, Feature set evaluation for offline handwriting recognition systems: application to the recurrent neural network model, IEEE transactions on cybernetics 46 (12) (2016) 2825–2836.
- (53) S. Jaeger, S. Manke, J. Reichert, A. Waibel, Online handwriting recognition: the npen++ recognizer, International Journal on Document Analysis and Recognition 3 (3) (2001) 169–180.
- (54) A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, S. Fernández, Unconstrained on-line handwriting recognition with recurrent neural networks, in: Advances in Neural Information Processing Systems, 2008, pp. 577–584.