Deep neural network based architectures [7, 19, 36, 32] have shown its efficacy in various challenging problems of multiple different domains throughout the last decade. Among those domains, handwritten characters or digits recognition is one of the important one. Researchers of this field have proposed numerous neural network based architectures to achieve better recognition accuracies. Researchers like Ciresan et al. , Sarkhel et al. , Gupta et al. , Roy et al. , Krizhevsky et al.  etc have proposed different neural network based architectures and optimization techniques for recognition of different handwritten characters and digits. Most of these researchers configured their proposed neural network based architectures empirically on the basis of exploratory laboratory experiments. Most recent researches [37, 25, 22, 21]
show that the optimal structure of the neural network based architectures can be found with Reinforcement learning based techniques. With the help of these techniques, the effort of rigorous laboratory experiment for finding the best structure of a neural network can be reduced to a significant extent.
In our present work we proposed a genetic algorithm  based approach to find the optimal structure of a multi-column convolutional neural network (CNN) . More specifically the proposed genetic algorithm based technique selects optimal combination of the kernel-sizes for the layers of our proposed multi-column CNN based architecture. Our proposed technique has been tested on three publicly available different datasets of Bangla scripts. The results showed a significant improvement, which proves the efficiency of the proposed methodology.
The rest of the paper is organized as follows: the detailed description of the proposed methodology is presented in sec:PMfig:blueFig. 2sec:PMtab:blueTable 2sec:PMsec:blueSection 2sec:PMEq:blueEq. 2sec:PMappendix1blueAppendix 1 , the training of the proposed architecture is described in sec:TAfig:blueFig. 3sec:TAtab:blueTable 3sec:TAsec:blueSection 3sec:TAEq:blueEq. 3sec:TAappendix1blueAppendix 1 , the experimental results are presented in sec:EXfig:blueFig. 4sec:EXtab:blueTable 4sec:EXsec:blueSection 4sec:EXEq:blueEq. 4sec:EXappendix1blueAppendix 1 and finally, a brief conclusion is drawn from the results in sec:COfig:blueFig. 5sec:COtab:blueTable 5sec:COsec:blueSection 5sec:COEq:blueEq. 5sec:COappendix1blueAppendix 1 .
2 Proposed Methodology
A detailed description of the proposed method is presented in this section. As mentioned earlier, a genetic algorithm based approach has been proposed for finding the optimal structure of a multi-column CNN used for recognizing handwritten characters and digits of multiple Bangla scripts. Generally, kernel sizes for the layers of a CNN architecture are finalized through experiments. Our proposed genetic algorithm based approach selects the optimal combination of kernel sizes for the different layers of every column of the multi-column CNN architecture.
2.1 A brief overview of the proposed genetic algorithm based approach
The primary stages of the proposed genetic algorithm based approach are described in the following:
2.1.1 Population Initialization:
The initial population of the algorithm is a collection of different sets of kernel combination (different sizes). That means each individual is a set of kernel sizes. The collection of different individuals are initialized randomly. The structure of an individual is a set containing the equal number of elements as the number of total layers of all the columns in the multi-column CNN architecture i.e. if each column has layers and there exists such columns then length of each individual will be (shown in fig:GAfig:blueFig. 1fig:GAtab:blueTable 1fig:GAsec:blueSection 1fig:GAEq:blueEq. 1fig:GAappendix1blueAppendix 1 ). The values of the elements will be , or as only , and kernel sizes are used.
2.1.2 Parent Selection:
Individuals from the population are selected as parents which will undergo crossover and mutation operation to produce offspring with generally better fitness value than the parents. These parents are selected with Roulette Wheel parent selection  mechanism.
2.1.3 Crossover Operation:
In our proposed genetic algorithm based approach single-point crossover operation is performed. In this operation two parent individuals participate and for each participating individual an index is randomly chosen and the elements of both the individuals are swapped against the randomly chosen index. That means, the elements in the left-hand part of the index of an individual are exchanged with the elements in the right-hand part of the index of another individual (shown in fig:GAfig:blueFig. 1fig:GAtab:blueTable 1fig:GAsec:blueSection 1fig:GAEq:blueEq. 1fig:GAappendix1blueAppendix 1 ).
2.1.4 Mutation Operation:
A random mutation operation has been used in our proposed genetic algorithm. In this mutation operation the number of elements as well as the elements themselves are selected randomly from an individual. That means, a random number say is chosen between and length of an individual. Then randomly select different elements from the individual and replace those elements with different kernel sizes (chosen randomly excluding the existing kernel value at that chosen index location) as shown in fig:GAfig:blueFig. 1fig:GAtab:blueTable 1fig:GAsec:blueSection 1fig:GAEq:blueEq. 1fig:GAappendix1blueAppendix 1 .
|Parameter name||Parameter value|
|Fitness value||Recognition accuracy on validation set|
|Parent selection method||Roulette wheel selection|
|Crossover type||Single point crossover|
|Mutation type||Random mutation|
|Parameter name||Parameter value|
|Initial learning rate||0.001|
|Learning decay rate||0.05|
Training epochs per generation
|Index||Name of the dataset||Dataset type||Number of training samples||Number of test samples||Reference|
|D2||CMATERdb 3.1.2||Bangla basic characters||12,000||3000|||
|D3||CMATERdb 3.1.3||Bangla compound characters||34,229||8468|||
2.2 Architectures of the proposed multi-column CNN
The initial configuration of every column of the multi-column CNN based architecture is following: 32C2-2P2-BN-RELU-128C1-BN-RELU-256C2-2P2-BN-RELU-2048FC. Some shorthand notations are used for space limitations. In this representation XCY denotes a convolutional layer with a total of X kernels and stride of Y pixels, MPN denotes a max-pooling layer with an 15]
, RELU denotes a Rectified Linear Unit activation layer
. Finally, the output feature maps are combined with feature concatenation and passed through a Softmax classifier as following: ZFC-Softmax. Here Z denotes the total number of classes. A graphical view of initial configuration of the proposed architecture is given in fig:Modelfig:blueFig.2fig:Modeltab:blueTable 2fig:Modelsec:blueSection 2fig:ModelEq:blueEq. 2fig:Modelappendix1blueAppendix 1 .
The strides are not fixed for every kernel sizes. As kernel sizes vary during the optimal structure finding stage, the strides values are also varied accordingly. For kernel size and we have used a stride of pixels and for kernel size we have used a stride of pixel.
As most of the information loss occurs at the max-pooling layers, only kernel size is used in the subsampling layers. On the other hand, as the convolutional sampling layers are comparatively less problematic kernels with , and sizes are permissible .
|Name of the dataset||Dataset type||Number of classes||Test set accuracy||Validation set accuracy|
|ISIBanglaDigit||Bangla Digits||10||99.12 %||99.78 %|
|CMATERdb 3.1.2||Bangla basic characters||50||97.10 %||96.57 %|
|CMATERdb 3.1.3||Bangla compound characters||171||94.77 %||94.88 %|
|Name of the dataset||Dataset type||Number of classes||Test set accuracy|
|ISIBanglaDigit||Bangla Digits||10||98.98 %|
|CMATERdb 3.1.2||Bangla basic characters||50||95.45 %|
|CMATERdb 3.1.3||Bangla compound characters||171||92.56 %|
|Name of the dataset||Dataset type||Number of classes||Test set accuracy|
|ISIBanglaDigit||Bangla Digits||10||99.02 %|
|CMATERdb 3.1.2||Bangla basic characters||50||95.56 %|
|CMATERdb 3.1.3||Bangla compound characters||171||93.12 %|
|Name of the dataset||Dataset type||Number of classes||Test set accuracy|
|ISIBanglaDigit||Bangla Digits||10||98.87 %|
|CMATERdb 3.1.2||Bangla basic characters||50||95.10 %|
|CMATERdb 3.1.3||Bangla compound characters||171||91.46 %|
3 Training the Proposed Architecture
The proposed architecture is trained with every individual from the population over 20 generations. After every generation the best fit (with highest fitness value) individuals are selected for our proposed crossover and mutation operations. The best fit offspring is then used to update the population by replacing the individual of minimum fitness value. The parameters of the proposed genetic algorithm is presented in tab:Genetic_Parametersfig:blueFig. 1tab:Genetic_Parameterstab:blueTable 1tab:Genetic_Parameterssec:blueSection 1tab:Genetic_ParametersEq:blueEq. 1tab:Genetic_Parametersappendix1blueAppendix 1
The fitness value of each individual is found during training of our proposed architecture. The training images are fed into every column the architecture simultaneously and the loss function is calculated at the end Softmax classifier
. The connection weights between the layers of every column of our proposed architecture are updated in a single pass of the backpropagation after every epoch using equation Eq:Rmsprop1fig:blueFig.1Eq:Rmsprop1tab:blueTable 1Eq:Rmsprop1sec:blueSection 1Eq:Rmsprop1Eq:blueEq. 1Eq:Rmsprop1appendix1blueAppendix 1 and Eq:Rmsprop2fig:blueFig. 2Eq:Rmsprop2tab:blueTable 2Eq:Rmsprop2sec:blueSection 2Eq:Rmsprop2Eq:blueEq. 2Eq:Rmsprop2appendix1blueAppendix 1 .
where, is moving average of squared gradients, is gradient of the cost function with respect to the weight, learning rate, is moving average parameter.
is value of the weight parameter at iteration t.
An adaptive learning rate RMSProp  learning algorithm is used to train our proposed architecture. CrossentropyLoss  is used as the loss function (shown in Eq:CELfig:blueFig. 3Eq:CELtab:blueTable 3Eq:CELsec:blueSection 3Eq:CELEq:blueEq. 3Eq:CELappendix1blueAppendix 1 ) during training of the proposed architecture. In every generation the proposed architecture is trained for epochs with every individual of the population. A variable learning rate  is used i.e. the learning rate is decreased by a factor of /epoch until it reaches to the value of while training. LeCun et al.  has suggested a technique of data shuffling before every epoch of training. As, shuffling introduces heterogeneity in the datasets and enhances the convergence rate of the learning algorithm, in our current experimental the training data is randomly shuffled before every epoch of RMSProp based training. Dropout regularization  is used (only in the FC layers) to reduce the possibility of overfitting of the network during training The parameters of the learning algorithm used to train the proposed architecture is presented in tab:Train_Parametersfig:blueFig. 2tab:Train_Parameterstab:blueTable 2tab:Train_Parameterssec:blueSection 2tab:Train_ParametersEq:blueEq. 2tab:Train_Parametersappendix1blueAppendix 1 .
where, is actual class, is predicted class, is total number of classes.
During training the individual are decoded for loading the kernels to every layer of each column. In our three-column based architecture as there are three layers at every column, the first three elements (kernel values) are loaded to three layers of first column respectively. Similarly, next three elements are loaded to the layers of second column and last three kernel values are loaded to the layers of the third column.
After training the best fit individual is selected from the population and update the kernel sizes of every layers of all the columns accordingly.
As mentioned earlier, a genetic algorithm based approach is proposed for recognizing handwritten character and digit images of multiple Bangla handwritten scripts. A Python based library, PyTorch is used to implement, train and test the proposed multi-column CNN based architecture. MATLAB is used to perform the basic image processing operations. All of the experiments are performed using systems with Intel dual core iprocessors, GB RAM and a NVIDIA GeForce Ti graphics card with GB internal memory.
4.1 Datasets used in our experiment
The proposed architecture is tested on three publicly available multiple datasets of Bangla scripts. The name, type and volume (number of training samples and testing samples) of the datasets used in our current experimental setup are given in tab:Datasetfig:blueFig. 3tab:Datasettab:blueTable 3tab:Datasetsec:blueSection 3tab:DatasetEq:blueEq. 3tab:Datasetappendix1blueAppendix 1 . These intricate handwritten datasets are significantly different from each other, thus while considering these multiple datasets constitute an ideal test set for our proposed architecture. More information about the datasets is given in the last column of tab:Datasetfig:blueFig. 3tab:Datasettab:blueTable 3tab:Datasetsec:blueSection 3tab:DatasetEq:blueEq. 3tab:Datasetappendix1blueAppendix 1 .
4.2 Pre-processing of datasets
A few numbers of pre-processing steps are used to process the images of the datasets used. Every image on isolated handwritten characters or digits is binarized and centre cropped by the tightest bounding box and finally resized to 32 x 32 pixels. Median and Gaussian filters are used to remove noises from the images.
After the pre-processing step, the training dataset is randomly divided into training set and validation set in such a way that the size of the validation set matches with the size of the test set. Now, the architecture is trained on the training set and saved against the best recognition accuracy achieved on validation set. After the network is trained, it is evaluated on the test set.
|Dataset type||Work reference||Recognition accuracy (%)|
|ISI digits||Sharif et al. ||99.02 %|
|Wen et al. ||96.91 %|
|Das et al. ||97.70 %|
|Akhnad et al. ||97.93 %|
|CNNAP ||98.98 %|
|The Present Work||99.12 %|
|Bangla basic characters||Roy et al. ||86.40 %|
|Das et al. ||80.50 %|
|Basu et al. ||80.58 %|
|Sarkhel et al. ||86.53 %|
|Bhattacharya et al. ||92.15 %|
|Lecun et al. ||92.88 %|
|The Present Work||97.10 %|
|Bangla compound characters||Das et al. ||75.05 %|
|Das et al. ||87.50 %|
|Sarkhel et al. ||78.38 %|
|Sarkhel et al. ||86.64 %|
|Lecun et al. ||86.85 %|
|Roy et al . ||90.33 %|
|The Present Work||94.77 %|
4.3 Experimental results
As mentioned before, the proposed technique is tested on three Bangla handwritten datasets. Among these three, one is Bangla digits dataset, one is Bangla basic characters dataset and the last one is Bangla compound characters dataset. The best recognition accuracy achieved on these three different datasets using our proposed methodology is presented in tab:Resultsfig:blueFig. 4tab:Resultstab:blueTable 4tab:Resultssec:blueSection 4tab:ResultsEq:blueEq. 4tab:Resultsappendix1blueAppendix 1 . For comparison, the recognition accuracies on these datasets using our proposed architecture on fixed scale i.e. in every layer using kernel size of only are presented in tab:Results3fig:blueFig. 5tab:Results3tab:blueTable 5tab:Results3sec:blueSection 5tab:Results3Eq:blueEq. 5tab:Results3appendix1blueAppendix 1 , kernel size of only are presented in tab:Results5fig:blueFig. 6tab:Results5tab:blueTable 6tab:Results5sec:blueSection 6tab:Results5Eq:blueEq. 6tab:Results5appendix1blueAppendix 1 and kernel size of only are presented in tab:Results7fig:blueFig. 7tab:Results7tab:blueTable 7tab:Results7sec:blueSection 7tab:Results7Eq:blueEq. 7tab:Results7appendix1blueAppendix 1 . From the experimental results we found that the network using kernel size gives better performance than the network using kernel size or . In case of the fixed kernel size the performance first increases from kernel size to kernel size and then again reduces for the kernel size . In case of multi-scaling, the network gives a better performance than all of these three fixed-scaled kernels. However, all the combinations which result in multi-scaling may not give better recognition accuracy than a fixed-scale kernel. This optimal combination of multi-scaling can be found using a genetic algorithm based approach which is the primary concern of this work.
To prove the efficiency of our proposed work, in tab:Result_Contempfig:blueFig. 8tab:Result_Contemptab:blueTable 8tab:Result_Contempsec:blueSection 8tab:Result_ContempEq:blueEq. 8tab:Result_Contempappendix1blueAppendix 1 we have presented the recognition performance of some of the contemporary works on the datasets used in our current experimental setup. The best recognition accuracy achieved by a system is made boldface in tab:Result_Contempfig:blueFig. 8tab:Result_Contemptab:blueTable 8tab:Result_Contempsec:blueSection 8tab:Result_ContempEq:blueEq. 8tab:Result_Contempappendix1blueAppendix 1 .
In our present work a methodology has been proposed to reduce the effort of finding the optimal combination of kernel sizes for the layers of a neural network based architecture. The genetic algorithm based technique selects the optimal kernel combination from an initial population of different kernel sizes after iterating over multiple generations. The researchers in the field of computer vision and other domains also can utilize this neural architecture search strategy to initialize optimal combination of kernel sizes including other hyper-parameters of their neural network based architectures before final training. This will results better performance along with less human effort. This genetic algorithm based neural architecture search methodology opens a new area of research towards pattern recognition including other domains.
The authors are thankful to the Center for Microprocessor Application for Training Education and Research (CMATER) and Project on Storage Retrieval and Understanding of Video for Multi- media (SRUVM) of Computer Science and Engineering Department, Jadavpur University, for providing infrastructure facilities during progress of the work. The current work, reported here, has been partially funded by University with Potential for Excellence (UPE), Phase-II, UGC, Government of India.
-  (2016) Convolutional neural network training with artificial pattern for bangla handwritten numeral recognition. In 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 625–630. Cited by: Table 8.
-  (2017) Isolated bangla handwritten character recognition with convolutional neural network. In 2017 20th International Conference of Computer and Information Technology (ICCIT), pp. 1–6. Cited by: Table 3.
Convergence guarantees for rmsprop and adam in non-convex optimization and their comparison to nesterov acceleration on autoencoders. arXiv preprint arXiv:1807.06766. Cited by: §3.
-  (2009) A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognition 42 (7), pp. 1467–1484. Cited by: Table 8.
-  (2008) Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE transactions on pattern analysis and machine intelligence 31 (3), pp. 444–457. Cited by: Table 3, §3.
-  (2006) On recognition of handwritten bangla characters. In Computer Vision, Graphics and Image Processing, pp. 817–828. Cited by: Table 8.
-  (2012) Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745. Cited by: §1, §3.
-  (2013) Improving deep neural networks for lvcsr using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8609–8613. Cited by: §2.2.
-  (2010) Handwritten bangla basic and compound character recognition using mlp and svm classifier. arXiv preprint arXiv:1002.4040. Cited by: Table 8.
-  (2012) A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application. Applied Soft Computing 12 (5), pp. 1592–1606. Cited by: §1, Table 8.
-  (2015) Handwritten bangla character recognition using a soft computing paradigm embedded in two pass approach. Pattern Recognition 48 (6), pp. 2054–2071. Cited by: Table 8.
-  (2003) Multi-category classification by soft-max combination of binary classifiers. In International Workshop on Multiple Classifier Systems, pp. 125–134. Cited by: §3.
-  (2016) Multi-label classification with feature-aware implicit encoding and generalized cross-entropy loss. In 2016 24th Iranian Conference on Electrical Engineering (ICEE), pp. 1574–1579. Cited by: §3.
-  (2019) Multiobjective optimization for recognition of isolated handwritten indic scripts. Pattern Recognition Letters 128, pp. 318–325. Cited by: §1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (1994) Word-level training of a handwritten word recognizer based on convolutional neural networks. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5), Vol. 2, pp. 88–92. Cited by: Table 8.
-  (2012) Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §3.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
-  (2012) Roulette-wheel selection via stochastic acceptance. Physica A: Statistical Mechanics and its Applications 391 (6), pp. 2193–2196. Cited by: §2.1.2.
-  (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1.
NSGA-net: neural architecture search using multi-objective genetic algorithm.
Proceedings of the Genetic and Evolutionary Computation Conference, pp. 419–427. Cited by: §1.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
-  (2011) Max-pooling convolutional neural networks for vision-based hand gesture recognition. In 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 342–347. Cited by: §2.2.
-  (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1.
-  (2015) Bangla handwritten character recognition using convolutional neural network. International Journal of Image, Graphics and Signal Processing (IJIGSP) 7 (8), pp. 42–49. Cited by: Table 8.
-  (2012) Region selection in handwritten character recognition using artificial bee colony optimization. In 2012 Third International Conference on Emerging Applications of Information Technology, pp. 183–186. Cited by: Table 8.
-  (2017) Handwritten isolated bangla compound character recognition: a new benchmark using a novel deep learning approach. Pattern Recognition Letters 90, pp. 15–21. Cited by: §1, Table 8.
-  (2012) CMATERdb1: a database of unconstrained handwritten bangla and bangla–english mixed script document image. International Journal on Document Analysis and Recognition (IJDAR) 15 (1), pp. 71–83. Cited by: Table 3.
A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts. Pattern Recognition 71, pp. 78–93. Cited by: §1, §2.2, §4.2.
-  (2016) A multi-objective approach towards cost effective isolated handwritten bangla character and digit recognition. Pattern Recognition 58, pp. 172–189. Cited by: Table 8.
Deterministic routing between layout abstractions for multi-scale classification of visually rich documents.
Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3360–3366. Cited by: §1.
-  (2015) An enhanced harmony search method for bangla handwritten character recognition using region sampling. In 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 325–330. Cited by: Table 8.
-  (2016) A hybrid deep model with hog features for bangla handwritten numeral classification. In 2016 9th International Conference on Electrical and Computer Engineering (ICECE), pp. 463–466. Cited by: Table 8.
-  (2012) A classifier for bangla handwritten numeral recognition. Expert Systems with Applications 39 (1), pp. 948–953. Cited by: Table 8.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: §1.
-  (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1.