A Skip-connected Multi-column Network for Isolated Handwritten Bangla Character and Digit recognition

04/27/2020 ∙ by Animesh Singh, et al. ∙ Jalpaiguri Government Engineering College The Ohio State University 3

Finding local invariant patterns in handwrit-ten characters and/or digits for optical character recognition is a difficult task. Variations in writing styles from one person to another make this task challenging. We have proposed a non-explicit feature extraction method using a multi-scale multi-column skip convolutional neural network in this work. Local and global features extracted from different layers of the proposed architecture are combined to derive the final feature descriptor encoding a character or digit image. Our method is evaluated on four publicly available datasets of isolated handwritten Bangla characters and digits. Exhaustive comparative analysis against contemporary methods establishes the efficacy of our proposed approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical character recognition (OCR) denotes the process of automatically recognizing characters from an optically scanned page of handwritten or printed texts. Some popular real world applications of OCR includes document information extraction Sarkhel and Nandi (2019b); Wu et al. (2018), question answering Andreas et al. (2016); Dong et al. (2015), classification of documents as potential applications Sarkhel and Nandi (2019a); Das et al. (2018)

etc. Recognition of handwritten characters is a difficult task as finding the local invariant patterns within handwritten character or digit images is not trivial. The already difficult task of OCR gets more challenging in case of stylistically varying handwritten texts as handwriting style varies from one person to another. In automated OCR systems, features describing the local invariant patterns are extracted and sophisticated classification models are employed to recognize a handwritten character or digit. Such feature-based OCR systems can be classified into two major categories:

explicit feature-based and non-explicit feature-based OCR systems. Syntactic or formal grammar-based features Feng and Pavlidis (1975)

, moment-based features

Singh et al. (2016), graph-theoretic approaches Kahan et al. (1987), shadow-based features Basu et al. (2009), gradient-based features Roy et al. (2012b) etc. are some of the most popular examples of explicit feature-based OCR system. In this approach handcrafted features are explicitly designed and extracted from a character or digit image. Contrary to the explicit feature-based approaches, non-explicit feature (NeF) based methods do not rely on handcrafted features. As raw character or digit images are fed into the OCR, a learned model automatically extracts features from the image. The model itself is learned over multiple iterations by minimizing the misclassification error. Artificial neural network Le Cun and Bengio (1994); Basu et al. (2012a), deep convolutional neural network based approaches Pratt et al. (2019); Chowdhury et al. (2019); Ukil et al. (2019)

, recurrent neural network

Ren et al. (2019); Paul et al. (2019); Ahmed et al. (2019)

, markov-model based approaches

Britto Jr et al. (2003); Bunke et al. (2004) and unsupervised sparse feature extraction Hasasneh et al. (2019) methods are some examples of non-explicit feature (NeF) based OCR systems. Statistically derived parameters play a significant role in those approaches.

Figure 1: A few samples of Bangla handwritten characters and digits

Based on spatial coverage two types of features have traditionally been used in explicit and non-explicit feature based OCR systems, local features and global features. Local features refer to a pattern or distinct structure found in a small image patch. They are usually associated with local properties of an image patch that differs from its immediate surroundings in texture, colour, or intensity. Examples of local features used in contemporary literature include blobs, corners, and edge pixels. Local features have been used in a wide variety of applications. A sampling based approach Nowak et al. (2006), proposed by Nowak et al., extracts visual descriptors for classification task by sampling the independent image patches. Haralick et al. has proposed a textural-feature based approach Haralick et al. (1973) which uses a piecewise liner decision rule and a min-max decision rule for identification of image data. A multi-scale differential model Schmid and Mohr (1997), developed by Schmid et al., uses a voting algorithm with semi-local constraints for retrieving images from large image databases. A deep convolutional feature based approach Babenko and Lempitsky (2015)

, proposed by Babenko et al., extracts local deep features from images with the help of a deep convolutional neural network. These local features are finally aggregated to produce compact global descriptors for image retrieval task. Hiremath et al. has proposed a

contend based approach Hiremath and Pujari (2007) for retrieval of images. This approach uses primitive local image descriptors like color moments, texture and shape for image retrieval. A hierarchical agglomerative clustering based model Mikolajczyk et al. (2005), developed by Mikolajczyk et al., combines local features of different objects using an entropy distribution for recognizing multiple object classes. An optimization approach Jiang et al. (2007), proposed by Jiang et al. tries to find the optimal combination of detector, kernel, vocabulary size and weighting scheme for bag-of-features (BoF). These aforementioned works are some popular examples of usage of local image features.

Global features, on the other hand, describe the entire image as a whole, representing the shapes and contours present in the image. Global features have also been used widely in the area of computer vision. Wang et al. has proposed a

combination technique Wang et al. (2009)

which extracts global, regional and contextual features from images and accumulates these extracted features by estimating their joint probability for automatic image annotation. Shyu et al. has developed a

content-based image retrieval model Shyu et al. (1998) which uses low-level computer vision technique and image processing algorithms for extracting features related to the different variations of gray scale, texture, shape etc. Labit et al. proposed a compact motion representation based method for semantic image sequence coding Labit and Nicolas (1991) which uses global image features. These are some popular examples of usage of local image features.

Both local and global image features play important roles in automated handwritten character and digit recognition systems. A region sampling based approach Das et al. (2012b); Sarkhel et al. (2015) to extract local features from the most discriminating regions of a character/digit image, an ensemble technique Das et al. (2012a) to combine quad-tree based longest-run features with statistical features using PCA Wold et al. (1987) and Modular PCA Sankaran and Asari (2004), artificial neural network-based approach Das et al. (2010); Basu et al. (2012b, a), deep convolutional neural network-based approach Benaddy et al. (2019); Roy et al. (2017)

, multilayered boltzmann perceptron network

Rehman et al. (2019)

to extract hard geometric features, Transfer learning based approach

Chatterjee et al. (2019) which used a pre-trained ResNet He et al. (2016), a multi-column based approach Sarkhel et al. (2017) to extract multi-scale local features from pattern images for recognition of handwritten characters and digits, a multi-layer capsule network based approach Mandal et al. (2019) etc are some examples of local feature extraction based handwritten character/digit recognition methods. Global image features have also been used for this purpose such as convex hull-based features Sarkhel et al. (2016); Das et al. (2014b) and chain code histogram-based features Bhattacharya et al. (2006), artificial neural network based approach Singh et al. (2014) etc. In some these aforementioned works a combination of local and global image features have also been used but in most of the works global features are extracted through an explicit based feature extraction technique. Designing a non-explicit based both global and local feature extraction method for recognition of handwritten character or digit is the primary concern of our present work.

Figure 2: A traditional convolutional architecture which consists of several convolutional and max-pooling layers stacked on the top of each other

Despite the progress on developing OCR systems for Indic scripts in the past decade, a commercially successful comprehensive system is yet to emerge. In this work, we have proposed an NeF based approach for recognizing handwritten Bangla digits and characters (shown in fig:HWCADfig:Fig. 1fig:HWCADtab:Table 1fig:HWCADsec:Section 1fig:HWCADEq:Eq. 1fig:HWCADappendix1Appendix 1 ) by combining global and local features extracted using a deep convolutional network (CNN). We have developed a multi-column skip-connected convolutional neural network (MSCNN) architecture for this purpose. Global features extracted from the initial layers of the network are combined with local features from the final layers of this convolutional architecture. These features are learned by training the network over multiple iterations to minimize misclassification error. We proposed a novel fusion technique to combine these global and local features extracted from different layers of the architecture to generate the final feature descriptor representing a character or a digit image. We have evaluated the proposed method on five publicly available, benchmark datasets of handwritten Bangla characters and digits. Promising results have been achieved for all the datasets. We have also tested our system on MNIST dataset and achieved a maximum accuracy of 99.65 %, without any data augmentation to the original data. A comparative analysis has also been performed against some of the contemporary methods to establish the superiority of our proposed method. The rest of the paper is organized as follows: sec:A Brief Overview Of CNNfig:Fig. 2sec:A Brief Overview Of CNNtab:Table 2sec:A Brief Overview Of CNNsec:Section 2sec:A Brief Overview Of CNNEq:Eq. 2sec:A Brief Overview Of CNNappendix1Appendix 1 introduces a brief overview of CNN based architectures, sec:Present Workfig:Fig. 3sec:Present Worktab:Table 3sec:Present Worksec:Section 3sec:Present WorkEq:Eq. 3sec:Present Workappendix1Appendix 1 presents a detailed description of the present work, the datasets used in the experimental setup and the experimental results are described in sec:Experimental Resultsfig:Fig. 4sec:Experimental Resultstab:Table 4sec:Experimental Resultssec:Section 4sec:Experimental ResultsEq:Eq. 4sec:Experimental Resultsappendix1Appendix 1 . A comparative study based on different possible combination technique is presented in sec:Comparative Analysisfig:Fig. 4.4sec:Comparative Analysistab:Table 4.4sec:Comparative Analysissec:Section 4.4sec:Comparative AnalysisEq:Eq. 4.4sec:Comparative Analysisappendix1Appendix 1 and finally, a brief conclusion is made from the results.

2 A brief overview of CNN

Convolutional neural networks (CNNs) Le Cun and Bengio (1994); Roy et al. (2017) are a class of feedforward networks generally used for recognizing patterns of an image. Just as artificial neural network (ANN), CNNs are also biologically inspired architectures. The hierarchical information processing style by the alternating layers of simple and complex cells of visual cortex in brain Serre et al. (2006) motivates the architecture of CNNs. In general, the architectures consist of several convolutional and pooling (or subsampling) layers stacked on the top of each other as shown in fig:CNN_architecturefig:Fig. 2fig:CNN_architecturetab:Table 2fig:CNN_architecturesec:Section 2fig:CNN_architectureEq:Eq. 2fig:CNN_architectureappendix1Appendix 1

. The convolutional layers learn the features representing the structures of the input image and thus, they serve as feature extractors. Each neuron in the convolutional layers relates to the neighborhood neurons of the previous layers via a set of learnable weights; also known as filter banks (LeCun et al.

LeCun et al. (2012)). These neurons in the convolutional layers form feature maps by arranging in a specific order. Patterns in input images are extracted with the learned weights in order to compute a new feature map (shown in fig:Working_of_CNNfig:Fig. 3fig:Working_of_CNNtab:Table 3fig:Working_of_CNNsec:Section 3fig:Working_of_CNNEq:Eq. 3fig:Working_of_CNNappendix1Appendix 1

) and this map is forward propagated into a non-linear activation function, which allow extracting non-linear features. Different parts of a feature map share similar weights to learn translational invariant features and different feature maps from same convolutional layer contain different weights which help learning multiple patterns from every part of a pattern image. Pooling layers play a significant role to achieve spatial invariance to input distortions and translations. Initially, most common pooling operation was performed with average pooling

Boureau et al. (2010), in which the average of all the inputs values over a small region of an image is propagated to the next layer. However, the most recent works (Tolias et al. Tolias et al. (2015); Giusti et al. Giusti et al. (2013); Nagi et al. Nagi et al. (2011); Scherer et al. Scherer et al. (2010); Murray et al. Murray and Perronnin (2014) etc) generally used max pooling Nagi et al. (2011); Giusti et al. (2013), in which maximum of all the input values is propagated to the next layer. Hence, convolutional layers and pooling layers are two main building blocks of CNN architectures. Therefore, the local invariant patterns are collected by the convolutional layers and processed by pooling layers to extract more powerful features.

In the field of OCR system, researchers Sarkhel et al. (2017) have found in their experiments that convolutional sampling at fixed scales often bounds the local invariant pattern searching capabilities of a CNN; but sampling at multiple scales allows a CNN to extract more robust and noise invariant features from different patches of an image. In order to address this scope, the filter banks of the proposed architecture have variable sizes (i.e. , , ) at different convolutional layers (as shown in fig:Multi_Columnfig:Fig. 5fig:Multi_Columntab:Table 5fig:Multi_Columnsec:Section 5fig:Multi_ColumnEq:Eq. 5fig:Multi_Columnappendix1Appendix 1 ). As, the maximum information loss occurs due to the pooling layers, we have used smaller window sizes (i.e. ) in our proposed system.

Figure 3: The convolutional operation performed by CNN layers
Figure 4: A multi-column based architecture used in our proposed system

Multi-column convolutional neural network (MCNN) (inspired by the microcolumns of neurons in the cerebral cortex) based architecture has shown a major breakthrough in various research works like handwritten text recognition Basu et al. (2012b); Singh et al. (2014); Cireşan et al. (2012); Sarkhel et al. (2017), traffic sign classification Ciresan et al. (2011), crowd counting Zhang et al. (2016); Sam et al. (2017) etc. Multiple columns of feedforward convolutional neural networks are used together to combine the output feature maps from all the columns. Different feature maps are generated for same image patches at different columns because the weights of the filter banks in convolutional layers vary from one column to other. When a pattern image is propagated through every column of a MSCNN based architecture, multiple feature maps from each image patch are extracted at the convolutional layers of the respective columns. Feature maps generated at one convolutional layer of a definite column differs from the similar (layers at same depth) convolutional layers of other columns. Similarly, for other layers of different columns, several variations of feature maps are created and when these multi-variate feature maps are combined more robust abstraction of a pattern image is created and provides a better prediction accuracy in a OCR system.

In multi-column CNN based architecture, multi-scaling can be applied in two different ways, one is column-wise multi-scaling and another is level-wise multi-scaling. In column-wise multi-scaling, multiple scales of convolutional filters (i.e. variable filter sizes used in convolutional layers) are used within each column of MSCNN based architecture, and more robust and geometrically invariant patterns are extracted from different patches of images within a single column Sarkhel et al. (2017). On the other hand, in level-wise multi-scaling similar types (layers at same depth) of convolutional layers of different column has different filter sizes (shown in fig:Multi_Columnfig:Fig. 5fig:Multi_Columntab:Table 5fig:Multi_Columnsec:Section 5fig:Multi_ColumnEq:Eq. 5fig:Multi_Columnappendix1Appendix 1 ), hence multi-variate patterns are sampled at the same levels of different columns. The patterns derived from these two types of multi-scaling techniques when accumulated create shift, scale and distortion invariant feature maps that contain more complex and deeper geometrical information of character or digit images.

Figure 5: Multi-column based CNN architecture in which level-wise multi-scaling and column-wise multi-scaling is used

3 Present work

The primary contribution of our present work is to utilize the information contained in both global and local features of pattern images in handwritten characters and digits recognition tasks. As mentioned before, global features define a geometrical abstraction of the entire image and local features define more detailed (upto pixel level) description of an image. Hence, global features are more sensitive towards clutter and occlusion, but local features are more robust in those cases Mikolajczyk et al. (2005). In our proposed system the global and the local features are extracted in a non-explicit based feature extraction approach from different parts of our network. From our experiments we observed that, the connection weights between the anterior layers of the network are more prone to learn global features and the connection weights between the posterior layers of the network are more likely to extract local features from pattern images. These global and local features, when gathered together, contain more robust and accurate description of pattern images and significantly improves recognition results over the methods which use either global or local features alone. A graphical representation of how global and local features are fused in our present work is shown in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 .

As mentioned above a NeF based approach using MSCNN based architecture for the recognition of isolated handwritten characters and digits of popular Bangla scripts has been proposed in the present work. Combination of features extracted at final FC layer of every column of the multi-column based architecture is used as implicit feature descriptor of pattern images. After learning the connection weights between the layers through training; the test image is forward propagated through the network and the final feature descriptor are extracted between the final fully-connected (FC) layer and the softmax classifier. Finally, the softmax classifier performs a classification task using the final extracted feature to predict the final class label. A graphical abstract of the proposed architecture is shown in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 . More detailed description is presented in sec:Description Of Proposed Architecturefig:Fig. 3.1sec:Description Of Proposed Architecturetab:Table 3.1sec:Description Of Proposed Architecturesec:Section 3.1sec:Description Of Proposed ArchitectureEq:Eq. 3.1sec:Description Of Proposed Architectureappendix1Appendix 1 and sec:Architecture Of Each Columnfig:Fig. 3.2sec:Architecture Of Each Columntab:Table 3.2sec:Architecture Of Each Columnsec:Section 3.2sec:Architecture Of Each ColumnEq:Eq. 3.2sec:Architecture Of Each Columnappendix1Appendix 1 . As mentioned in the previous section, the proposed system uses two different types of multi-scaling techniques: column-wise multi-scaling and level-wise multi-scaling, which enhance the global and local invariant pattern searching capabilities of the proposed architecture from character or digit images. When these multi-scaled local and global features are combined improves the recognition performance of our proposed network on multiple intricate handwritten character and digit images.

3.1 Description of the proposed architecture

Details of the MSCNN-based architecture is presented in this section. The architecture contains three columns with a subtle architectural difference among them. Architecture of each of the columns has been configured empirically. Each column has three levels and each of the levels consists of a single convolutional layer or a stack of convolutional and pooling layers as shown in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 . From each level of every column, the output feature map is forward propagated through a fully-connected (FC) layer to extract more useful and necessary features from the feature maps extracted at convolutional layers (shown in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 ). Without these FC layers, if the feature maps from the convolutional layers are directly used, the network may learn redundant and confusing features for updating the connection weights. Hence, the FC layers play a significant role to eradicate some of those unnecessary features from the feature maps extracted at different levels of the network. The configurations of these local FC layers are also finalized empirically. The feature generation process is explained in Eq:11fig:Fig. 1Eq:11tab:Table 1Eq:11sec:Section 1Eq:11Eq:Eq. 1Eq:11appendix1Appendix 1 - Eq:F33fig:Fig. 18Eq:F33tab:Table 18Eq:F33sec:Section 18Eq:F33Eq:Eq. 18Eq:F33appendix1Appendix 1 and feature combination process is described in Eq:W1fig:Fig. 19Eq:W1tab:Table 19Eq:W1sec:Section 19Eq:W1Eq:Eq. 19Eq:W1appendix1Appendix 1 - Eq:FFfig:Fig. 22Eq:FFtab:Table 22Eq:FFsec:Section 22Eq:FFEq:Eq. 22Eq:FFappendix1Appendix 1 . The proposed methodology combines multi-scaled global and local invariant features sampled at multiple layers of the network through a features-concatenation method (shown in fig:concatenationfig:Fig. 6fig:concatenationtab:Table 6fig:concatenationsec:Section 6fig:concatenationEq:Eq. 6fig:concatenationappendix1Appendix 1 ).

Figure 6: Feature concatenation method used in our present work to combine different types of features coming from different parts of the present architecture
Figure 7: A graphical abstract of the proposed architecture. The architecture uses global and local features extracted by a multi-column multi-scale convolutional neural network
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
(18)

Some notations are used to describe the procedure of feature combination after each level of the proposed MSCNN based architecture. represents the the output function of convolutional layer (along with maxpool layer) at level and column of the proposed architecture, where denotes the corresponding convolutional and maxpooling layer and denotes the input. represents the output function of fully-connected layer at level and column of the proposed architecture, where denotes corresponding FC layer with input , where is . represents the output function of final fully-connected layer at column level with input , where is the concatenated features from all the columns of level . represents the output function of the fully-connected layer where the input is original image . represents the output function of final FC layer of the proposed architecture, where is the final concatenated feature from different levels of the network. Feature concatenation is denoted by . Input image is forward propagated through every column and different feature maps are generated at different levels of the architecture. , , extract features from the input images and the local FC layers , , extract more useful features from the feature maps extracted at the previous sampling layers. Similarly, , , extract features from the feature maps sampled at , , in the second column and , , extract features from the feature maps sampled at , , in the third column. The output features of each local FC layer at same levels of different columns are combined by the proposed features-concatenation technique (shown in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 ). This means features from first levels of all the columns (i.e. output features of , , ) are combined, similarly features from the second levels of all the columns (i.e. output features of , , ) are combined and features from the third levels of all the columns (i.e. output features of , , ) are combined. These concatenated features at different levels are propagated through , , layers, and finally, the output features of , , , layers are combined with feature concatenation method to generate the final feature descriptor(). With this feature combination method, basically, we are trying to assemble similar kind of features i.e. features coming from equal numbers of sampling (convolutional and pooling) layers. This methodology helps to accommodate several variations of patterns from different image patches into a single bank of feature descriptors. These feature banks created at different levels contain features sampled at multiple scales, hence provide a better geometrical abstraction of pattern images. At the first level the features are extracted from one convolutional layer, similarly at the second level the features are extracted from two convolutional layers and one maxpooling layer and at the final level the features are extracted from three convolutional layers and two maxpooling layers. All the features from the same levels do not represent the exact same abstraction of the actual image because different columns have different filter sizes in their respective convolutional layers. Combination of these similar kind of features denotes one of the significant contribution of the present work.

(19)
(20)
(21)
(22)

3.2 Architecture of each column

The configuration of every column is given in this section. The first column has the following architecture: 32C2-2P2-BN-RELU-64C1-BN-RELU-256C1-2P2-BN-


RELU. In this representation XCY denotes a convolutional layer with a total of X kernels and stride of Y pixels, MPN denotes a max-pooling layer with an

pooling window and stride of N pixels, BN denotes Batch-Normalization

Ioffe and Szegedy (2015)

, RELU denotes a Rectified Linear Unit activation layer

Maas et al. (2013); Dahl et al. (2013). Every level of the first column has a definite FC layer. In the first column, after the first level 1024FC-BN-RELU layer is used; similarly, after the second level and the third level 2048FC-BN-RELU layer and 1024FC-BN-RELU layer are used respectively. In this representation NFC denotes a fully-connected layer with N numbers of output neurons. The second column has the following architecture: 32C1-2P2-BN-RELU-64C1-BN-RELU-256C2-2P2-BN-RELU. In the second column, after the first level, second level and third level 3584FC-BN-RELU layer, 5120FC-BN-RELU layer and 2048FC-BN-RELU layer are used respectively. And the third column has the following architecture: 32C1-2P2-BN-RELU-64C1-BN-RELU-256C1-2P2-BN-RELU. In the third column, after the first level, second level and third level 2560FC-BN-RELU layer, 8192FC-BN-RELU layer and 8192FC-BN-RELU layer are used respectively. A graphical representation of the architecture is shown in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 .

From first level of all the columns, the extracted features from the local FC layers are combined and the concatenated features are forward propagated through 3584FC-BN-RELU layer. Similarly, from the second and third level the extracted features are combined and forward propagated through 8192FC-BN-RELU layer and 5120FC-BN-RELU layer respectively. And the actual image is also directly passed through a 512FC-BN-RELU layer. Output features from these four outer fully-connected layers are again combined through features-concatenation method (shown in fig:concatenationfig:Fig. 6fig:concatenationtab:Table 6fig:concatenationsec:Section 6fig:concatenationEq:Eq. 6fig:concatenationappendix1Appendix 1

) and finally the resultant feature vector is propagated through 2048FC-BN-RELU-softmax layer. An RBF kernel based multi-class SVM (One vs All) classifier

Niu and Suen (2012) is used as softmax classifier in the present architecture and by performing / variations the parameters of the kernel are tuned empirically. As mentioned before, three different convolutional kernel dimensions i.e. , , (as in fig:Multi_Columnfig:Fig. 5fig:Multi_Columntab:Table 5fig:Multi_Columnsec:Section 5fig:Multi_ColumnEq:Eq. 5fig:Multi_Columnappendix1Appendix 1 ) are used in our present work for multi-scaling feature extraction.

3.3 Training the proposed architecture

The proposed architecture is trained with an end-to-end training method in which all the three columns is trained simultaneously. The training images are passed through all the columns simultaneously and the loss is calculated at the final FC layer. The connection weights between the layers of the whole architecture are updated in a single pass of backpropagation after every epoch using equations Eq:Rmsprop1fig:

Fig. 23Eq:Rmsprop1tab:Table 23Eq:Rmsprop1sec:Section 23Eq:Rmsprop1Eq:Eq. 23Eq:Rmsprop1appendix1Appendix 1 , Eq:Rmsprop2fig:Fig. 24Eq:Rmsprop2tab:Table 24Eq:Rmsprop2sec:Section 24Eq:Rmsprop2Eq:Eq. 24Eq:Rmsprop2appendix1Appendix 1 .

(23)

where, is moving average of squared gradients, is gradient of the cost function with respect to the weight, learning rate, is moving average parameter.

(24)

is value of the weight parameter at iteration t.

Parameter name Parameter value
Learning algorithm RMSProp
Initial learning rate 0.001
Learning decay rate 0.993
Dropout 0.5
Batch size 500
Total number of training epochs 500
Table 1: Parameters of learning algorithm used in MSCNN based architecture.

An adaptive learning process RMSProp Ruder (2016); Basu et al. (2018) learning algorithm is used to train our proposed system. All the three columns are trained simultaneously against same cost calculated at the final FC layer. For calculating the loss during training crossentropyLoss Farahnak-Ghazani and Baghshah (2016)

is used as the loss function (shown in Eq:CELfig:

Fig. 25Eq:CELtab:Table 25Eq:CELsec:Section 25Eq:CELEq:Eq. 25Eq:CELappendix1Appendix 1 ). The proposed architecture is trained for 500 epochs with variable learning rate Cireşan et al. (2012), in which the learning rate is decreased by a factor of 0.993/epoch until it reaches to the value of 0.00003. As suggested by LeCun et al. LeCun et al. (2012), the dataset is randomly shuffled before each epoch of RMSProp based training. Shuffling introduces heterogeneity in the datasets and increases the convergence rate of the learning algorithm. The proposed architecture is trained until the error rate converges or the total number of epochs reaches the aforementioned maximum. Dropout regularization Srivastava et al. (2014); Wager et al. (2013) is used (only in the FC layers except the final FC layer) to train the proposed architecture to help reduce the chances of overfitting of the network while training. The parameter of the learning algorithm used to train the proposed architecture is presented in tab:Parametersfig:Fig. 1tab:Parameterstab:Table 1tab:Parameterssec:Section 1tab:ParametersEq:Eq. 1tab:Parametersappendix1Appendix 1 . After the training is over, an SVM based classifier with Gaussian kernel is used to perform the classification tasks. SVM based classifier takes the output feature vector from the final FC layer of the proposed architecture as input and after completion of training, it predicts the actual class label of each test data.

(25)

where, is actual class, is predicted class, is total number of classes.

As we create a validation set randomly from the training images for every dataset used in our experiments, the size of the training dataset is reduced and consequently the final training set becomes insufficient to train the whole architecture. To overcome this problem, we first divide the initial training set in train set and validation set as done before. The proposed architecture is now trained on training set and saved against the best accuracy achieved on validation set. Now the epoch number is noted for the best validation accuracy. Now the network is again trained on the initial training set (before creating the validation set) until the noted epoch number is achieved and the network is saved. Now the architecture is test against the test set as done previously.

Index Name of the dataset Dataset type Number of training samples Number of test samples Reference
D1 CMATERdb 3.1.1 Bangla digits 4000 2000  Das et al. (2012a)
D2 ISIBanglaDigit ISI numerals 23,500 4000  Bhattacharya and Chaudhuri (2009)
D3 CMATERdb 3.1.2 Bangla basic characters 12,000 3000  Sarkar et al. (2012)
D4 CMATERdb 3.1.3 Bangla compound characters 34,229 8468  Das et al. (2014a)
Table 2: Datasets used in the present work.

3.4 Extraction of multi-scale feature

Once the network is trained and connection weights are learnt, the test image is propagated through each of the column of the architecture separately. As the columns are independent from each other, they extract features independently in the corresponding convolutional and pooling layers. As mentioned earlier, the feature maps sampled at every level of different columns are combined and more useful features are extracted through an FC layers (as in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 ). After performing feature extraction through every column, the features are combined to produce the resultant feature vector which is then forward propagated through a final FC layer to extract final feature descriptors. A softmax classifier is trained with the finally extracted feature (as in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 ) and after training, the classifier is used to predict the final class label of the test image. Hence, the final concatenated global and local features is used as the final feature descriptors of the original input images in this present architecture. As mentioned above, a multi-class SVM classifier is used for classification instead of an MLP based classifier because SVM performs better than MLP in handwritten digits or characters recognition tasks Das et al. (2010).

Every convolutional and maxpooling layer of the proposed architecture decreases the feature map size from its previous layer. Generally, most of the information loss occurs due to the maxpooling layers. For this reason, it is imperative to used smaller kernels (i.e. ) in the pooling layers. As the convolutional layers are less problematic than maxpooling layers, the larger kernel (i.e. , , ) in the convolutional layers are admissible.

In order to extract both global and local features from multiple scales different kernel sizes are used at convolutional layers of every level of different columns (as in fig:Multi_Columnfig:Fig. 5fig:Multi_Columntab:Table 5fig:Multi_Columnsec:Section 5fig:Multi_ColumnEq:Eq. 5fig:Multi_Columnappendix1Appendix 1 ). This allows multi-scaled feature extraction within same level. Also, for a definite column the consecutive convolutional layers have kernel sizes different from each other, which allows multi-scaled feature extraction within each column Sarkhel et al. (2017). The combination of these multi-scale features represents better abstraction of the input image than the features from fixed scales. These multi-scaled global and local features extracted through two different types (i.e. level-wise multi-scaling and column-wise multi-scaling) of multi-scaling approach are used to recognize different stylistically varying Bangla handwritten characters and digits.

4 Experiments

As mentioned before, a non-explicit global and local feature extraction technique has been proposed in this present work. For this purpose a MSCNN based architecture is used for recognition of handwritten characters and digits from multiple Bangla scripts. A Python based library, PyTorch is used to implement and train the proposed MSCNN based architecture. Basic image processing operations are performed using MATLAB. All of the experiments are performed using systems with Intel dual core i5 processors, 8 GB RAM and a NVIDIA GeForce 1050 Ti graphics card with 12 GB internal memory.

4.1 Datasets

The proposed architecture is tested on four publicly available datasets on isolated handwritten characters and digits, belonging to popular Bangla scripts. Details of the datasets used in our current experiment is shown in the tab:Datasetfig:Fig. 2tab:Datasettab:Table 2tab:Datasetsec:Section 2tab:DatasetEq:Eq. 2tab:Datasetappendix1Appendix 1 . These significantly intricate datasets, which are different from each other belonging to Bangla scripts on isolated handwritten characters or digits comprise an ideal test suite for the present work. More details of these datasets can be found in references cited in the right most column of the table.

Name of the dataset Dataset type Number of classes Random validation set (%) -Fold cross-validation (%) -Fold cross-validation (%)
CMATERdb 3.1.1 Bangla digits 10 98.15 % 98.18 % 98.27 %
ISIBanglaDigit ISI numerals 10 99.36 % 99.34 % 99.38 %
CMATERdb 3.1.2 Bangla basic characters 50 96.65 % 96.56 % 96.7 %
CMATERdb 3.1.3 Bangla compound characters 171 93.48 % 93.5 % 93.53 %
Table 3: Recognition accuracy achieved by our proposed architecture on different datasets.

4.2 Pre-processing of datasets

Images of each datasets are passed through a few numbers of pre-processing steps. Every image on isolated handwritten characters or digits is binarized and centre-cropped by the tightest bounding box

Sarkhel et al. (2017) and finally resized to pixels and normalized with mean 0.5 and std 0.5. Also, noise in the images are removed by combination of median and Gaussian filter.

After the pre-processing step, the training dataset is randomly divided into training set and validation set in such a way that the size of the validation set matches with the size of the test set. Now, the architecture is trained on the training set and saved against the best recognition accuracy achieved on validation set. After the network is trained, it is evaluated on the test set.

4.3 Experimental results

The experimental results on every dataset is presented in this section. As mentioned earlier, the proposed methodology is tested on four publicly available benchmark datasets on isolated handwritten characters or digits on Bangla scripts. Among these four datasets two are belonging to Bangla digits, one is on Bangla basic characters and last one is on Bangla compound characters. Details of these datasets are given in tab:Datasetfig:Fig. 2tab:Datasettab:Table 2tab:Datasetsec:Section 2tab:DatasetEq:Eq. 2tab:Datasetappendix1Appendix 1 . The proposed MSCNN based architecture is applied on each of these four datasets 15 times and the SVM based classifier performed the classification task every time based on the extracted features from CNN based architecture. We have also presented recognition accuracies achieved by our proposed architecture using 5-Fold and 10-Fold cross-validation techniques on the training set. And the best classification result achieved among all the trials for every dataset is listed in tab:Result_PAfig:Fig. 3tab:Result_PAtab:Table 3tab:Result_PAsec:Section 3tab:Result_PAEq:Eq. 3tab:Result_PAappendix1Appendix 1 . From our experimental observation we have found that the results achieved by the present architecture is better than the results achieved by the architecture which uses only the local features of the input images. For comparison the results achieved by using only local features are given in tab:Result_LAfig:Fig. 4tab:Result_LAtab:Table 4tab:Result_LAsec:Section 4tab:Result_LAEq:Eq. 4tab:Result_LAappendix1Appendix 1 .

4.4 Comparative analysis

In this section we described various types of comparisons with our proposed architecture. As, mentioned before the proposed network uses a feature fusion technique to combine global and local features sampled at each convolutional layer of the network. There can be other kinds of feature combination methods other than our proposed feature fusion technique. We presented some of those combination techniques in sec:Different combintionfig:Fig. 4.4.1sec:Different combintiontab:Table 4.4.1sec:Different combintionsec:Section 4.4.1sec:Different combintionEq:Eq. 4.4.1sec:Different combintionappendix1Appendix 1 and did a comparison based study to prove the superiority of our proposed combination technique. Also, to prove the efficacy of using global and local features for recognition handwritten characters or digits, we compared our proposed architecture with some of most popular contemporary works in sec:contemporary worksfig:Fig. 4.4.2sec:contemporary workstab:Table 4.4.2sec:contemporary workssec:Section 4.4.2sec:contemporary worksEq:Eq. 4.4.2sec:contemporary worksappendix1Appendix 1 .

Name of the dataset Dataset type Proposed architecture (%) Using local features only (%)
CMATERdb 3.1.1 Bangla digits 98.15 % 95.75 %
ISIBanglaDigit ISI numerals 99.36 % 98.68 %
CMATERdb 3.1.2 Bangla basic characters 96.65 % 94.45 %
CMATERdb 3.1.3 Bangla compound characters 93.48 % 90.56 %
Table 4: Comparison between recognition accuracies achieved by the architecture which use only local features and the proposed architecture.
Name of the dataset Dataset type First baseline (%) Second baseline (%) Third baseline (%) Proposed architecture (%)
CMATERdb 3.1.1 Bangla digits 97.15 % 97.76 % 97.52 % 98.15 %
ISIBanglaDigit ISI numerals 99.12 % 99.18 % 99.14 % 99.36 %
CMATERdb 3.1.2 Bangla basic characters 95.92 % 96.43 % 95.96% 96.65 %
CMATERdb 3.1.3 Bangla compound characters 91.15 % 92.34 % 91.46 % 93.48 %
Table 5: Comparison between the recognition accuracies achieved by the proposed architecture and different feature combination techniques.
Dataset type Work reference Recognition accuracy (%)
Bangla digits Das et al. Das et al. (2012b) 97.80 %
Basu et al. Basu et al. (2012a) 96.67 %
Roy et al. Roy et al. (2012b) 95.08 %
Roy et al. Roy et al. (2014) 97.45 %
Lecun et al. Le Cun and Bengio (1994) 97.31 %
The Present Work 98.15 %
ISI numerals Sharif et al. Sharif et al. (2016) 99.02 %
Wen et al. Wen and He (2012) 96.91 %
Das et al. Das et al. (2012b) 97.70 %
Akhnad et al. Rahman et al. (2015) 97.93 %
CNNAP Akhand et al. (2016) 98.98 %
Roy et al. Roy et al. (2012b) 93.338 %
Rajashekararadhya et al. Rajashekararadhya and Ranjan (2009) 94.20 %
The Present Work 99.36 %
Bangla basic characters Roy et al. Roy et al. (2012a) 86.40 %
Das et al. Das et al. (2010) 80.50 %
Basu et al. Basu et al. (2009) 80.58 %
Sarkhel et al. Sarkhel et al. (2015) 86.53 %
Bhattacharya et al. Bhattacharya et al. (2006) 92.15 %
Lecun et al. Le Cun and Bengio (1994) 92.88 %
The Present Work 96.65 %
Bangla compound characters Das et al. Das et al. (2010) 75.05 %
Das et al. Das et al. (2015) 87.50 %
Sarkhel et al. Sarkhel et al. (2015) 78.38 %
Sarkhel et al. Sarkhel et al. (2016) 86.64 %
Pal et al. Pal and Pawar (2015) 93.12 %
Lecun et al. Le Cun and Bengio (1994) 86.85 %
Roy et al . Roy et al. (2017) 90.33 %
The Present Work 93.48 %
Table 6: A comparative analysis of the proposed methodology with some of the popular contemporaries.

4.4.1 Comparing Different Column Combination Techniques

The proposed system has been performed on five datasets and significantly improved results have been achieved. In the proposed multi-column CNN based architecture various combination techniques of combining global and local features are possible. Some of the combination techniques are presented in this section and a comparison based study on the recognition accuracy achieved by using these techniques is taken up.

First baseline:

The proposed multi-column CNN architecture has three columns as in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 and from every convolutional layer features are extracted through FC layers. These extracted features are combined to generate the final feature descriptor which is then used for recognition of multiple handwritten characters as mentioned in sec:Present Workfig:Fig. 3sec:Present Worktab:Table 3sec:Present Worksec:Section 3sec:Present WorkEq:Eq. 3sec:Present Workappendix1Appendix 1 . Other than the proposed level-wise feature combination technique (shown in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 ) the features can be combined by concatenating all the feature vectors sampled at every local FC layer and finally forward propagated through a final FC layer with 2048 output neurons to generate the final output feature vector. The recognition accuracy achieved by the first baseline architecture on all the five datasets are listed in tab:Result_Allfig:Fig. 5tab:Result_Alltab:Table 5tab:Result_Allsec:Section 5tab:Result_AllEq:Eq. 5tab:Result_Allappendix1Appendix 1 .

Second baseline:

Feature vectors extracted at all the local FC layers of the proposed multi-column architecture (as shown in fig:Architecturefig:Fig. 7fig:Architecturetab:Table 7fig:Architecturesec:Section 7fig:ArchitectureEq:Eq. 7fig:Architectureappendix1Appendix 1 ) can also be combined with a column-wise feature combination technique, i.e. features from every local FC layer of a definite column are concatenated. Features sampled at every local FC layers of first column are concatenated and passed through 2048FC (FC layer with 2048 output neurons). Similarly, concatenated features from second and third column are forward propagated through 5120FC and 8192FC respectively. Finally, the features from these columns are combined with the input features extracted by a 512FC layer and the final concatenated feature is passed through a final 2048FC layer. The recognition accuracy achieved by the second baseline architecture on all the five datasets are given in tab:Result_Allfig:Fig. 5tab:Result_Alltab:Table 5tab:Result_Allsec:Section 5tab:Result_AllEq:Eq. 5tab:Result_Allappendix1Appendix 1 .

Third baseline:

As mentioned earlier the proposed architecture consists of three columns (as in fig:combinationfig:Fig. 4fig:combinationtab:Table 4fig:combinationsec:Section 4fig:combinationEq:Eq. 4fig:combinationappendix1Appendix 1 ) and features from every local FC layer are concatenated to generate final feature descriptor. In the third combination technique the feature maps from convolutional layers are combined with feature-concatenation technique (shown in fig:concatenationfig:Fig. 6fig:concatenationtab:Table 6fig:concatenationsec:Section 6fig:concatenationEq:Eq. 6fig:concatenationappendix1Appendix 1 ). An extra skip connection is introduced between the convolutional layers of each column of our proposed architecture. The output feature maps of the first level is combined with the output feature maps of the second level for every column to form the input feature maps to the third level. The recognition accuracy achieved by the third baseline architecture is given in tab:Result_Allfig:Fig. 5tab:Result_Alltab:Table 5tab:Result_Allsec:Section 5tab:Result_AllEq:Eq. 5tab:Result_Allappendix1Appendix 1 . The graphical abstracts of the architectures described in the above baselines are illustrated in details in the accompanying supplementary files.

Transformation Bangla digits ISI numerals Bangla basic characters Bangla compound characters
None 98.15 % 99.36 % 96.65 % 93.48 %
ColourJitter 98.70 % 99.38 % 96.85 % 93.12 %
Random Horizontal Flip 98.10 % 99.32 % 96.70 % 92.88 %
Random Vertical Flip 98.54 % 99.22 % 96.20 % 92.82 %
Random Crop 98.60 % 99.40 % 95.20 % 93.55 %
Rotation 98.52 % 99.34 % 96.63 % 94.30 %
Random Affine Transformation 98.20 % 99.17 % 94.45 % 92.12 %
Table 7: Best performance on the test set using different data augmentation techniques
Dataset type Simultaneous Training Separate Training
Bangla digit 98.15 % 97.55 %
ISI numerals 99.36 % 98.25 %
Bangla basic character 96.65 % 95.88 %
Bangla compound character 93.48 % 91.75 %
Table 8: Comparison with recognition accuracy between different training methods.

4.4.2 Comparisons against contemporary works

Significantly improved results have been achieved by the proposed system for all of the datasets used in our experimental setup. To prove the superiority of the proposed method, its performance is compared with some of the popular, contemporary works. The best recognition accuracy achieved by a contemporary system is shown in boldface. The comparative analysis is given in tab:Result_Contempfig:Fig. 6tab:Result_Contemptab:Table 6tab:Result_Contempsec:Section 6tab:Result_ContempEq:Eq. 6tab:Result_Contempappendix1Appendix 1 .

4.5 Effects of different pre-processing and training techniques

In this section we presented some of popular data augmentation techniques to improve the performance of the proposed architecture and showed a comparison between the recognition accuracy achieved with different data augmentation techniques. This section also presented different training techniques to train a multi-column CNN based architecture.

4.5.1 Data augmentation

Sometimes to improve the performance each input is stochastically transformed during training a CNN-based architecture [9, 12]. We used 6 different types of transformations (e.g. crop, rotation, flip etc) in our current experimental setup. Each of the transformation is applied separately to train our proposed architecture and the recognition accuracy is presented in tab:Result_Data_Augfig:Fig. 7tab:Result_Data_Augtab:Table 7tab:Result_Data_Augsec:Section 7tab:Result_Data_AugEq:Eq. 7tab:Result_Data_Augappendix1Appendix 1 . The best recognition accuracy on each dataset is shown in boldface.

With colourjitter we randomly change the brightness, contrast and saturation of input images with a factor of 0.05. With random horizontal flip we randomly flip input images horizontally with probability 0.5. Similarly, with random vertical flip we randomly flip input images vertically with probability 0.5. With random crop we crop the images at random locations. With rotation the training images were rotated by an angle of . With random affine transformation we did affine transformation randomly between the range of [, ] of each image with the center invariant.

4.5.2 Training methods

As mentioned earlier, our proposed architecture is trained with an end-to-end training method, in which all the three columns are simultaneously trained against same loss calculated at the final FC layer of the architecture. Another common practice to train a multi-column CNN based architecture is to train each column separately and finally merge all the columns to predict the final class label of each test data. In the later training technique, first every column of our proposed architecture is trained separately on train set and saved against the best recognition accuracy achieved on validation set. After all the three columns are trained, the local FC layers of all convolutional layers along with other FC layers are trained as follows : batches of training data are forward propagated through every previously trained column and the features maps generated at each sampling layer is passed through its local FC layer and these feature maps are gradually extracted through the other FC layers. The connection weights between the FC layers are updated using backpropagation against the loss calculated at final FC layer of our proposed architecture. Finally, the connection weights between all the FC layers are saved against the minimum loss achieved on validation set. We tested the performances of both the training techniques and found that our proposed network performs better in case of simultaneous training strategy. For comparison, the performance of both training methods on the test set are shown in tab:Result_SVSfig:Fig. 8tab:Result_SVStab:Table 8tab:Result_SVSsec:Section 8tab:Result_SVSEq:Eq. 8tab:Result_SVSappendix1Appendix 1 .

5 Conclusion

As mentioned in our present work, a multi-scale multi-column skip convolutional neural network based architecture is proposed for recognition of various handwritten characters and digits. This architecture uses a combination of multi-scale global and local geometric features of pattern images for generating ubiquitous pattern which describes the pattern images in more precise way. After the exploratory experiments on different methods on multi-column CNN based architecture, we conclude that the multi-scale global and local features of a pattern image together can represent more robust and impeccable description of the original image. Significantly better results on recognition accuracy advocate the effectiveness of the proposed methodology. This proposed methodology opens a new area in research towards pattern recognition tasks on different types of pattern images.

Acknowledgments

The authors are thankful to the Center for Microprocessor Application for Training Education and Research (CMATER) and Project on Storage Retrieval and Understanding of Video for Multi- media (SRUVM) of Computer Science and Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India.

Conflict of interest declaration

The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; participation in speakers’ bureaus; membership, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript to the best of their knowledge.

References

  • S. B. Ahmed, S. Naz, S. Swati, and M. I. Razzak (2019) Handwritten urdu character recognition using one-dimensional blstm classifier. Neural Computing and Applications 31 (4), pp. 1143–1151. Cited by: §1.
  • M. Akhand, M. Ahmed, and M. H. Rahman (2016) Convolutional neural network training with artificial pattern for bangla handwritten numeral recognition. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 625–630. Cited by: Table 6.
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705. Cited by: §1.
  • A. Babenko and V. Lempitsky (2015) Aggregating local deep features for image retrieval. In Proceedings of the IEEE international conference on computer vision, pp. 1269–1277. Cited by: §1.
  • A. Basu, S. De, A. Mukherjee, and E. Ullah (2018)

    Convergence guarantees for rmsprop and adam in non-convex optimization and their comparison to nesterov acceleration on autoencoders

    .
    arXiv preprint arXiv:1807.06766. Cited by: §3.3.
  • S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu (2009) A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognition 42 (7), pp. 1467–1484. Cited by: §1, Table 6.
  • S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu (2012a) An mlp based approach for recognition of handwrittenbangla’numerals. arXiv preprint arXiv:1203.0876. Cited by: §1, §1, Table 6.
  • S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu (2012b) Handwritten bangla alphabet recognition using an mlp based classifier. arXiv preprint arXiv:1203.0882. Cited by: §1, §2.
  • M. Benaddy, O. El Meslouhi, Y. Es-saady, and M. Kardouchi (2019) Handwritten tifinagh characters recognition using deep convolutional neural networks. Sensing and Imaging 20 (1), pp. 9. Cited by: §1.
  • U. Bhattacharya and B. B. Chaudhuri (2009) Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE transactions on pattern analysis and machine intelligence 31 (3), pp. 444–457. Cited by: Table 2.
  • U. Bhattacharya, M. Shridhar, and S. K. Parui (2006) On recognition of handwritten bangla characters. In Computer Vision, Graphics and Image Processing, pp. 817–828. Cited by: §1, Table 6.
  • Y. Boureau, J. Ponce, and Y. LeCun (2010) A theoretical analysis of feature pooling in visual recognition. In

    Proceedings of the 27th international conference on machine learning (ICML-10)

    ,
    pp. 111–118. Cited by: §2.
  • A. d. S. Britto Jr, R. Sabourin, F. Bortolozzi, and C. Y. Suen (2003) The recognition of handwritten numeral strings using a two-stage hmm-based method. International Journal on Document Analysis and Recognition 5 (2-3), pp. 102–117. Cited by: §1.
  • H. Bunke, S. Bengio, and A. Vinciarelli (2004) Offline recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE transactions on Pattern analysis and Machine intelligence 26 (6), pp. 709–720. Cited by: §1.
  • S. Chatterjee, R. K. Dutta, D. Ganguly, K. Chatterjee, and S. Roy (2019) Bengali handwritten character classification using transfer learning on deep convolutional neural network. arXiv preprint arXiv:1902.11133. Cited by: §1.
  • R. R. Chowdhury, M. S. Hossain, R. Ul Islam, K. Andersson, and S. Hossain (2019) Bangla handwritten character recognition using convolutional neural network with data augmentation. In Joint 2019 8th International Conference on Informatics, Electronics & Vision (ICIEV), Cited by: §1.
  • D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber (2011) A committee of neural networks for traffic sign classification.. In IJCNN, pp. 1918–1921. Cited by: §2.
  • D. Cireşan, U. Meier, and J. Schmidhuber (2012) Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745. Cited by: §2, §3.3.
  • G. E. Dahl, T. N. Sainath, and G. E. Hinton (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §3.2.
  • A. Das, S. Roy, U. Bhattacharya, and S. K. Parui (2018) Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3180–3185. Cited by: §1.
  • N. Das, K. Acharya, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri (2014a) A benchmark image database of isolated bangla handwritten compound characters. International Journal on Document Analysis and Recognition (IJDAR) 17 (4), pp. 413–431. Cited by: Table 2.
  • N. Das, B. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri (2010) Handwritten bangla basic and compound character recognition using mlp and svm classifier. arXiv preprint arXiv:1002.4040. Cited by: §1, §3.4, Table 6.
  • N. Das, S. Pramanik, S. Basu, P. K. Saha, R. Sarkar, M. Kundu, and M. Nasipuri (2014b) Recognition of handwritten bangla basic characters and digits using convex hull based feature set. arXiv preprint arXiv:1410.0478. Cited by: §1.
  • N. Das, J. M. Reddy, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu (2012a) A statistical–topological feature combination for recognition of handwritten numerals. Applied Soft Computing 12 (8), pp. 2486–2495. Cited by: §1, Table 2.
  • N. Das, R. Sarkar, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu (2012b)

    A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application

    .
    Applied Soft Computing 12 (5), pp. 1592–1606. Cited by: §1, Table 6.
  • N. Das, R. Sarkar, S. Basu, P. K. Saha, M. Kundu, and M. Nasipuri (2015) Handwritten bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recognition 48 (6), pp. 2054–2071. Cited by: Table 6.
  • L. Dong, F. Wei, M. Zhou, and K. Xu (2015) Question answering over freebase with multi-column convolutional neural networks. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    ,
    pp. 260–269. Cited by: §1.
  • F. Farahnak-Ghazani and M. S. Baghshah (2016) Multi-label classification with feature-aware implicit encoding and generalized cross-entropy loss. In 2016 24th Iranian Conference on Electrical Engineering (ICEE), pp. 1574–1579. Cited by: §3.3.
  • H. Feng and T. Pavlidis (1975) Decomposition of polygons into simpler components: feature generation for syntactic pattern recognition. IEEE Transactions on Computers 100 (6), pp. 636–650. Cited by: §1.
  • A. Giusti, D. C. Cireşan, J. Masci, L. M. Gambardella, and J. Schmidhuber (2013) Fast image scanning with deep max-pooling convolutional neural networks. In 2013 IEEE International Conference on Image Processing, pp. 4034–4038. Cited by: §2.
  • R. M. Haralick, K. Shanmugam, et al. (1973) Textural features for image classification. IEEE Transactions on systems, man, and cybernetics (6), pp. 610–621. Cited by: §1.
  • A. Hasasneh, N. Salman, and D. Eleyan (2019) Towards offline arabic handwritten character recognition based on unsupervised machine learning methods: a perspective study. International Journal of Computing 8 (1), pp. 1–8. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • P. Hiremath and J. Pujari (2007) Content based image retrieval using color, texture and shape features. In 15th International Conference on Advanced Computing and Communications (ADCOM 2007), pp. 780–784. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.2.
  • Y. Jiang, C. Ngo, and J. Yang (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM international conference on Image and video retrieval, pp. 494–501. Cited by: §1.
  • S. Kahan, T. Pavlidis, and H. S. Baird (1987) On the recognition of printed characters of any font and size. IEEE Transactions on Pattern Analysis & Machine Intelligence (2), pp. 274–288. Cited by: §1.
  • C. Labit and H. Nicolas (1991) Compact motion representation based on global features for semantic image sequence coding. In Visual Communications and Image Processing’91: Visual Communication, Vol. 1605, pp. 697–709. Cited by: §1.
  • Y. Le Cun and Y. Bengio (1994) Word-level training of a handwritten word recognizer based on convolutional neural networks. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5), Vol. 2, pp. 88–92. Cited by: §1, §2, Table 6.
  • Y. A. LeCun, L. Bottou, G. B. Orr, and K. Müller (2012) Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §2, §3.3.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.2.
  • B. Mandal, S. Dubey, S. Ghosh, R. Sarkhel, and N. Das (2019) Handwritten indic character recognition using capsule networks. arXiv preprint arXiv:1901.00166. Cited by: §1.
  • K. Mikolajczyk, B. Leibe, and B. Schiele (2005) Local features for object class recognition. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 2, pp. 1792–1799. Cited by: §1, §3.
  • N. Murray and F. Perronnin (2014) Generalized max pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2473–2480. Cited by: §2.
  • J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cireşan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber, and L. M. Gambardella (2011) Max-pooling convolutional neural networks for vision-based hand gesture recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. Cited by: §2.
  • X. Niu and C. Y. Suen (2012) A novel hybrid cnn–svm classifier for recognizing handwritten digits. Pattern Recognition 45 (4), pp. 1318–1325. Cited by: §3.2.
  • E. Nowak, F. Jurie, and B. Triggs (2006) Sampling strategies for bag-of-features image classification. In European conference on computer vision, pp. 490–503. Cited by: §1.
  • A. Pal and J. Pawar (2015)

    Recognition of online handwritten bangla characters using hierarchical system with denoising autoencoders

    .
    In 2015 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC), pp. 0047–0051. Cited by: Table 6.
  • I. J. L. Paul, S. Sasirekha, D. R. Vishnu, and K. Surya (2019)

    Recognition of handwritten text using long short term memory (lstm) recurrent neural network (rnn)

    .
    In AIP Conference Proceedings, Vol. 2095, pp. 030011. Cited by: §1.
  • S. Pratt, A. Ochoa, M. Yadav, A. Sheta, and M. Eldefrawy (2019) Handwritten digits recognition using convolution neural networks. The Journal of Computing Sciences in Colleges, pp. 40. Cited by: §1.
  • M. M. Rahman, M. Akhand, S. Islam, P. C. Shill, M. Rahman, et al. (2015) Bangla handwritten character recognition using convolutional neural network. International Journal of Image, Graphics and Signal Processing (IJIGSP) 7 (8), pp. 42–49. Cited by: Table 6.
  • S. Rajashekararadhya and P. V. Ranjan (2009) Zone-based hybrid feature extraction algorithm for handwritten numeral recognition of two popular indian scripts. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pp. 526–530. Cited by: Table 6.
  • A. Rehman, M. Harouni, and T. Saba (2019) Cursive multilingual characters recognition based on hard geometric features. arXiv preprint arXiv:1904.08760. Cited by: §1.
  • H. Ren, W. Wang, and C. Liu (2019) Recognizing online handwritten chinese characters using rnns with new computing architectures. Pattern Recognition 93, pp. 179–192. Cited by: §1.
  • A. Roy, N. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri (2012a) Region selection in handwritten character recognition using artificial bee colony optimization. In 2012 Third International Conference on Emerging Applications of Information Technology, pp. 183–186. Cited by: Table 6.
  • A. Roy, N. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri (2014)

    An axiomatic fuzzy set theory based feature selection methodology for handwritten numeral recognition

    .
    In ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I, pp. 133–140. Cited by: Table 6.
  • A. Roy, N. Mazumder, N. Das, R. Sarkar, S. Basu, and M. Nasipuri (2012b) A new quad tree based feature set for recognition of handwritten bangla numerals. In 2012 IEEE International Conference on Engineering Education: Innovative Practices and Future Trends (AICERA), pp. 1–6. Cited by: §1, Table 6.
  • S. Roy, N. Das, M. Kundu, and M. Nasipuri (2017)

    Handwritten isolated bangla compound character recognition: a new benchmark using a novel deep learning approach

    .
    Pattern Recognition Letters 90, pp. 15–21. Cited by: §1, §2, Table 6.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §3.3.
  • D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4031–4039. Cited by: §2.
  • P. Sankaran and V. Asari (2004)

    A multi-view approach on modular pca for illumination and pose invariant face recognition

    .
    In 33rd Applied Imagery Pattern Recognition Workshop (AIPR’04), pp. 165–170. Cited by: §1.
  • R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, and D. K. Basu (2012) CMATERdb1: a database of unconstrained handwritten bangla and bangla–english mixed script document image. International Journal on Document Analysis and Recognition (IJDAR) 15 (1), pp. 71–83. Cited by: Table 2.
  • R. Sarkhel, N. Das, A. Das, M. Kundu, and M. Nasipuri (2017) A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts. Pattern Recognition 71, pp. 78–93. Cited by: §1, §2, §2, §2, §3.4, §4.2.
  • R. Sarkhel, N. Das, A. K. Saha, and M. Nasipuri (2016) A multi-objective approach towards cost effective isolated handwritten bangla character and digit recognition. Pattern Recognition 58, pp. 172–189. Cited by: §1, Table 6.
  • R. Sarkhel and A. Nandi (2019a) Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    ,
    pp. 3360–3366. Cited by: §1.
  • R. Sarkhel and A. Nandi (2019b) Visual segmentation for information extraction from heterogeneous visually rich documents. In Proceedings of the 2019 International Conference on Management of Data, pp. 247–262. Cited by: §1.
  • R. Sarkhel, A. K. Saha, and N. Das (2015) An enhanced harmony search method for bangla handwritten character recognition using region sampling. In 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 325–330. Cited by: §1, Table 6.
  • D. Scherer, A. Müller, and S. Behnke (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In International conference on artificial neural networks, pp. 92–101. Cited by: §2.
  • C. Schmid and R. Mohr (1997) Local grayvalue invariants for image retrieval. IEEE transactions on pattern analysis and machine intelligence 19 (5), pp. 530–535. Cited by: §1.
  • T. Serre, L. Wolf, and T. Poggio (2006) Object recognition with features inspired by visual cortex. Technical report MASSACHUSETTS INST OF TECH CAMBRIDGE DEPT OF BRAIN AND COGNITIVE SCIENCES. Cited by: §2.
  • S. Sharif, N. Mohammed, N. Mansoor, and S. Momen (2016) A hybrid deep model with hog features for bangla handwritten numeral classification. In 2016 9th International Conference on Electrical and Computer Engineering (ICECE), pp. 463–466. Cited by: Table 6.
  • C. Shyu, C. Brodley, A. Kak, A. Kosaka, A. Aisen, and L. Broderick (1998) Local versus global features for content-based image retrieval. In Proceedings. IEEE Workshop on Content-Based Access of Image and Video Libraries (Cat. No. 98EX173), pp. 30–34. Cited by: §1.
  • P. K. Singh, R. Sarkar, and M. Nasipuri (2016) A study of moment based features on handwritten digit recognition. Applied Computational Intelligence and Soft Computing 2016, pp. 4. Cited by: §1.
  • P. Singh, A. Verma, and N. S. Chaudhari (2014) Handwritten devnagari digit recognition using fusion of global and local features. International Journal of Computer Applications 89 (1). Cited by: §1, §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.3.
  • G. Tolias, R. Sicre, and H. Jégou (2015) Particular object retrieval with integral max-pooling of cnn activations. arXiv preprint arXiv:1511.05879. Cited by: §2.
  • S. Ukil, S. Ghosh, S. M. Obaidullah, K. Santosh, K. Roy, and N. Das (2019) Improved word-level handwritten indic script identification by integrating small convolutional neural networks. Neural Computing and Applications, pp. 1–16. Cited by: §1.
  • S. Wager, S. Wang, and P. S. Liang (2013) Dropout training as adaptive regularization. In Advances in neural information processing systems, pp. 351–359. Cited by: §3.3.
  • Y. Wang, T. Mei, S. Gong, and X. Hua (2009) Combining global, regional and contextual features for automatic image annotation. Pattern Recognition 42 (2), pp. 259–266. Cited by: §1.
  • Y. Wen and L. He (2012) A classifier for bangla handwritten numeral recognition. Expert Systems with Applications 39 (1), pp. 948–953. Cited by: Table 6.
  • S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §1.
  • S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsinas, P. Levis, and C. Ré (2018) Fonduer: knowledge base construction from richly formatted data. In Proceedings of the 2018 International Conference on Management of Data, pp. 1301–1316. Cited by: §1.
  • Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: §2.