Technical advancement has led to converting analogue data into digital ones. Digital data has a wide range of automation support, such as searching text from documents, modification, context extraction, etc. Further, technical advancement with the help of artificial intelligence has enabled combating against pandemics[4, 3], optimal mitigation, recognition of fabric elements , intuitively optimal solutions  and so on. Hence, digitalising paper documents would benefit more extensive usage of information.
OCR scans an image and generates a machine-encoded file. OCR is a popular technology used in automated data capture solutions and document classifications. There are different types of OCR systems, including intelligent word recognition, intelligent character recognition, optical word recognition, optical character recognition, and optical mark recognition. It has a broad area of research due to its usefulness. OCR systems are used for converting books, documents, and images into a computerised file. OCR is used in diverse domains including banks, post-offices, defence organisations, license plate recognition, reading aid for the blind, library automation, language processing, multimedia system design, education institute, etc. People can use an OCR to reduce the complexity of digitising documents manually. OCR systems can benefit organisations by facilitating with better process speed, enhanced workforce, and lower costs. With the development of digital computers and scanning devices, OCR technology is improved in the middle of the 1950s. The OCR system has two major categories: typewritten and handwritten scripts. Typewritten scripts typed in computers before the process of recognition starts. On the other hand, handwritten scripts are written by humans and then recognised by the OCR system. Typewritten OCR systems are more likely easy to implement than handwritten OCR systems. Also, typewritten recognition systems’ success rate is more than the handwritten ones, as they are less complicated and less variation is observed. Implementing an OCR system is not easy as machines cannot perceive information from an image like the human brain. Therefore, many researchers have made efforts to transform a document into a machine-readable file since the middle 1950s.
There are many established languages worldwide, like English, Chinese, Arabic, Japanese, Bengali, etc. Bengali is the fifth most-spoken native language and the seventh most spoken language by the world’s total number of speakers. Many different kinds of documents that include letters, textbooks, novels, official documents, legacy documents, newspapers, magazines, data entry forms, etc., are in Bengali that needed digitalisation. Many renowned Bengali handwritten pieces of literature also need to be computerised and stored.
Bengali language has much more difficult characters because of its shapes than any other language. In the Bengali language, it consists of 11 vowels (Shoroborno), 39 consonants (Benjonborno) and two or more characters combine to form new characters called compound (Juktoborno) characters. Thereby, building an OCR that recognises Bengali characters is more complicated than recognising any other language characters. Hence, many researchers have been working to develop an OCR system for identifying Bengali characters since the middle 1980s. Since then, Bengali OCR is a large field of research, and many researchers have proposed state-of-the-art solutions. But yet, to the best of our knowledge, there is no OCR that recognises handwritten Bengali words from images.
In this paper, we propose a model that recognises handwritten Bengali words from images. Our approach incorporates end-to-end architecture that depends on CTC loss function. From an architectural perspective, we combine deep convolutional neural network, that works as a feature extractor from handwritten images. Further, the extracted features (scanned from left to right of a word image) is passed to a recurrent layer, specifically, LSTM, or GRU. The recurrent layers extract dependent features based on the previously seen image slices and produce high dimensional features. Finally, a fully connected layer generates a probability distribution of the final prediction.
The overall contribution of the paper includes:
We introduce an end-to-end word recognition system for the Bengali language. This is the first research endeavour that explores end-to-end strategy in Bengali OCR to the best of our knowledge.
We investigate end-to-end strategy’s performance with established Bengali dataset, BanglaWriting .
As a feature extractor, we use four different baselines (Xception, NASNet, MobileNet, and DenseNet) and conclude that deeper architecture with residuals performs better in case of Bengali handwritten OCR.
The rest of the paper is constructed as follows: Section II highlights the works conducted in the domain of optical character recognition, specifically for the Bengali language. Section III explicates the proposed architecture and means undertaken to build the system. Section IV-B describes the overall experiments undertaken to evaluate the proposed method. Finally, Section V concludes the paper.
Ii Related Work
Many complete OCR systems exist for Bengali scripts as it is a research topic since the 1980s. Different researchers have already done many remarkable jobs in this field. Some of these works are worth mentioning. B.B. Chowdhuri and U. Pal proposed “OCR in Bangla: an Indo-Bangladeshi language”  and also suggested a complete printed Bengali OCR system 
including the feature extraction process for recognition.
J. U. Mahmud, et.al., proposed another complete OCR that recognises isolated and continuous printed multi-font Bengali characters , achieving 98% recognition rate in isolated characters and 96% recognition rate in continuous characters.
A. Chowdury, et.al., proposed a better approach for Optical Character Recognition of Bengali Characters using neural networks . They also describe the efficient ways of involving line and word detection, zoning, character separations, and character recognition.
Hasnat M.A. et al. proposed a domain-specific OCR for Bengali script 
, using Hidden Markov Model (HMM) for character classification and added a particular error correcting module to handle the errors that occurred at the preprocessing level. Finally, when the word is formatted, they added a dictionary and defined rules to correct the probable errors.
All the aforementioned efforts have a significant limitation, i.e., they work with printed or typewritten scripts only, and not capable of recognising handwritten scripts. To overcome this limitation, many researchers have worked with Bengali handwritten scripts.
Pramanik R, Bag S proposed a novel shape decomposition-based segmentation technique to decompose the compound characters into prominent shape components 
. They have claimed that this shape decomposition technique reduces the classification complexity in terms of less number of classes to recognise, and at the same time improves the recognition accuracy. Further, they used a chain code histogram feature set with a multi-layer perceptron (MLP) based classifier with backpropagation learning for classification.
M. Al Rabbani Alif, et.al. proposed a modified ResNet-18 architecture to recognise Bengali handwritten characters . They restructured the ResNet-18 architecture by adding dropout layers that boost the classification performance. They have applied their architecture on the BanglaLekha-Isolated dataset  and CMATERdb dataset  and obtained an accuracy of % and %, respectively.
Rakshit P., et.al proposed a scheme for tri-level segmentation (line, word, and character) for Bengali handwritten scripts . They have achieved an average of % accuracy on line segmentation, % accuracy on word segmentation, and % accuracy on character segmentation for the dataset of Bengali handwritten text documents.
Hasan F., Shuvo S.N. et al. proposed a new methodology to recognise the character from continuous Bengali handwritten characters using CNN . They take continuous Bengali handwritten text images as an input and then segment the input texts into their constituent words and segment each word into individual characters. They have used the EkushNet dataset model  which includes 50 basic characters, 10 character modifiers, 52 frequently used conjunct characters, 10 digits, and able to segment % words from text and % characters from the words.
There are few studies in the literature that only segments Bengali handwriting words. S. Basu, et al. , proposed a fuzzy technique for segmentation of handwritten Bengali word images. First, they identify the Matra(i.e., the longest straight-line connects multiple characters to make a Bengali word) using a fuzzy feature from the target word image. Then some parts of the Matra are identified as a segment point by using three fuzzy features. They use only 210 samples of handwritten Bengali words to run the experiment, and they claim % average accuracy.
Pramanik R., Bag S. 
, proposed a method for recognising handwritten Bengali and Devanagari words that detect and correct skew present in words and then estimated the headline, segmenting the words into a meaningful pseudo character. This is the only research work that recognizes Bengali word from images as far as our knowledge. They extract three different statistical features and combine them and apply CNN-based transfer learning architecture. After that, they combine the identified pseudo characters to make the full word. They claimed% accuracy in recognising Bengali words from images in their proposed segmentation methodology. To run the experiment they have used 2000 Bengali word images from Cmaterdb dataset version 1.1.1  and 1.5.1 , ICDAR 2013 Segmentation dataset , and PHDIndic 11 dataset . However, 2000 data is insufficient to build and benchmark an OCR system for handwritten Bengali words.
All of these proposed architectures depend on outdated strategies for Bengali OCR. Hence, this paper proposed an end-to-end approach that recognises handwritten Bengali words from handwritten word images. Further, experiments conducted in similar research work  use fewer words (at most 2000 words) in the evaluation phase. In contrast, we use 16975 words from BanglaWritting dataset to run the experiment. To build the system, we use different CNN and RNN architectures. Finally, we show a benchmark with the evaluated results.
This method recognises handwritten Bengali words from word-level images. First, we extract features from the images, and then we use loss function to train the model and calculate the loss and error. The complete methodology is divided into the following steps: (a) Data collections and preprocessing, (b) Features extraction, and (c) Loss and error calculation. Fig 1. represents a visual of the mentioned workflow.
Iii-a Data preprocessing
Iii-A1 Reshape and normalization
Before fitting word-level image data into the architecture, we preprocessed the data to fulfil some conditions. First, each word-level image is reshaped into 50 by 200 pixels. Second, we consider the maximum number of characters in a word to be 10. Hence the word with characters of more than 10 is ignored from both text and image database. The image data are then normalized in the range of [-1, 1], as data normalization certifies a parallel data distribution in every input parameter. Each input image is normalized as,
Here is the single-channel word image matrix, is the number of rows, and is the number of columns of the word image matrix.
Data augmentation is a popular technique to expand the variety of trainable data and to avoid overfitting while training models. Therefore we will use some augmentation techniques to create a diverse dataset. We use the open-source albumentation library that implements various image augmentation strategies. Fig. 2 represents the augmentation for one image. The implemented augmentation strategies are described below.
Horizontal cutout erases some part of the image by adding some random horizontally black box.
Vertical cutout erases some part of the image by adding some random vertical black box.
Gaussian noise adds some dot randomly to create noise in the image.
Shift scale rotation shifts and rotate the image within a given scale.
Optical distortion distorts pixel patterns of handwritten images.
Grid distortion elastically distorts written patterns, causing written lines to shrink and bend.
The Affine transformation adds a regular grid of points on the image and randomly moves the neighbourhood of these points around.
Iii-B Features Extraction
Feature extraction is the process of finding necessary information from an image to identify the word. The feature extraction process is segmented into two parts in our architecture: a baseline model and a stack of bidirectional RNN model. The input layer passes the image in the shape of to get the best results in our architecture. Then, the baseline model directly receives input images and extracts high dimensional features. To create the baseline model, we use famous CNN architectures rather than building our own to minimise the workload and get the full feature. We have modified the pre-trained convolutional model by deducting some layers, guaranteeing 25 left-to-right frames for the end-to-end method. Convolutional layers take input images, extract useful features and give a lower-dimensional output. To build the end to end system, we fix operational output as
(72 is the number of unique characters). Therefore, we experiment with four different convolutional architectures and evaluate their performance in the results analysis section. The features obtained from the baseline model is passed to the bidirectional RNN model. In the bidirectional RNN model, two bidirectional RNN layers are assigned to predict the word from the image by analysing the features’ sequence. Since general RNN has vanishing gradient problems, we use two modified RNN, Long Short Term Memory(LSTM)
and Gated Recurrent Unit (GRU) to avoid vanishing gradient problems. We experiment with these two modified versions of RNN and evaluate their accuracy in the results analysis section. The bidirectional RNN layer’s output passes through a fully connected dense layer with a softmax activation function and provides the final output. The final output shape must beto fit in the CTC loss function. Fig. 3 illustrates a block diagram of full architecture.
Iii-C Loss and error calculation
We used the Connectionist Temporal Classification (CTC)  loss function to train our model and calculate the loss and error throughout the training. The CTC loss function does not need aligned data. Instead, it works by adding all possible probability of alignments. As our handwritten data is not aligned, we use the CTC loss function. Also, CTC can train the whole model on its own. Therefore, we used the CTC loss function for our end to end architecture.
Iv Experimental analysis
Iv-a The Bengali handwritten dataset
To evaluate the proposed architecture, BanglaWritting  dataset is used. The dataset contains single-page handwriting of 260 different peoples. Every page includes a unicode to represent the writing and bounding box to bound each word. The dataset contains 21,234 words. However, we took 16975 words to run the experiment, resulting in 73 unique Bengali characters. Fig. 4 includes the example of images.
Iv-B Experimental setup
The proposed architecture is implemented using Tensorflow
, Keras, Matplotlib , NumPy  and Python . Albumentation  is used to augment the dataset. CTC loss function is used to measure the loss and error of the architecture. The dataset is divided into train, validation, and test subsets, where each one contains %, %, and
% of data, respectively. Each architecture is trained using a batch size of 16 with a maximum epoch limit of 1000 with 0.001 learning rate. Adam optimiser is used to train the full architecture.
Iv-C Evaluation metrics
Two evaluation metrics have been used to compare and evaluate the performance of our architecture. These are presenting as follows:
Character Error Rate: Character Error Rate (CER) indicates the number of erroneous predictions made by the OCR system. To calculate the CER, we use the edit distance algorithm, which is used as follows,
Here, means substitutions; means insertions, means deletions.
Substitutions occur when a character gets replaced within a word. Insertions happen when an extra character gets added in a word that was not in the actual word. Deletions is when a character gets removed from the word that was present in the actual word.
Word Error Rate: Word Error Rate(WER) indicates the number of handwritten words that OCR does not recognize properly. The WER is calculated by comparing predicted words with testing data formalized as,
Floating point operation per second used to measure the performance of the model. It calculates the number of arithmetic operations needed to run the deep learning model. Lower FLOPs indicate lower time complexity of the model.
Iv-D Results analysis
To build the end-to-end OCR system, we experiment with four famous architectures. Yet, some modifications are made in the architecture to sync the architectures with the current implementation properly. The modifications are discussed below:
Famous deep CNN architectures are used as the baseline of the model to extract useful information from images. In this paper, we focus on end-to-end architecture more than the baseline model. Hence, we used four different pre-trained convolutional models trained on ImageNet dataset that includes MobileNet, DenseNet121 , Xception , and NASNetMobile . For the implementation, we used the Keras framework. We have deducted some layers of these architectures to reach a fixed operational output. Figure 5, 6, and 7 illustrates the modification of the MobileNet, DenseNet121, and Xception architecture, respectively. Furthr, the deductions are discussed below.
MobileNet is a famous convolutional architecture pre-trained on ImageNet dataset. It has a total of 29 deep convolutional layers. We use the first 11 deep convolutional layers from MobileNet.
DenseNet121: Densenet121 is a convolution architecture in which each layer is connected with the deeper layers. From Densenet121, We use the layers till dense block 2.
Xception is another convolution architecture. It has three flows; Entry flow, Middle flow and Exit flow. In our architecture, we use only the Entry flow without the last max-pooling layer.
NASNetMobile: NASNet mobile is a convolutional architecture that can classify an image into 1000 categories. We use 71 activation layers, (defined as ‘activation_71‘ in Keras framework) from NASNet mobile in our baseline.
All of these, baseline architecture compared with each other in Bengali handwritten OCR, are presented in Table I. The comparison represents the FLOPs, Loss, CER, WER of all the architectures calculated on the train, validation, and test dataset. The comparison shows that the end-to-end architecture achieves the best result using DenseNet121 with GRU recurrent layers. In DenseNet121, each dense convolutional layer is connected with a skip connection with the previous and later dense convolution layer. As the restructured network of DenseNet121 was substantially deep, the architecture performs according to the actual implementation. Hence, DenseNet121 achieves excellent performance. On the contrary, in case of Xception, most layers have been removed, resulting in heavy performance degradation. MobileNet and NASNet mobile perform a good result considering the CER. However, the WER values for both of the baselines are high. As both of the architectures include lesser residuals, they tend to lose some common character appearances. Therefore, although the models’ CER values are low, the repetition of similar character errors result in higher WER. Hence, from the overall observation, DenseNet121 achieves a considerably better CER and WER than other architectures.
The paper presents an end to end OCR system that recognizes Bengali handwritten words from images. The proposed OCR system is implemented based on an end-to-end architecture and experimented with using a rich Bengali handwriting dataset called BanglaWritting. The baseline architecture is implemented and evaluated by four different pre-trained CNN architectures (i.e. MobileNet, Xception, DenseNet121, NASNet mobile). Further, we use two different bidirectional RNN (LSTM, GRU) types. DenseNet121 with GRU achieves the best results CER and WER . However, to improve this system in the future, a better investigation neural network is required rather than a pre-trained network as a baseline model. Automatic word segmentation is also essential to build an end-to-end OCR system. The research work is a leap towards achieving more robust investigation and implementation related to Bengali OCR systems.
Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §IV-B.
-  (2017) Isolated bangla handwritten character recognition with convolutional neural network. In 2017 20th International conference of computer and information technology (ICCIT), pp. 1–6. Cited by: §II.
-  (2021) Blockchain for decentralized multi-drone to combat covid-19. arXiv preprint arXiv:2102.00969. Cited by: §I.
-  (2020) Blockchain for multi-robot collaboration to combat covid-19 and future pandemics. arXiv preprint arXiv:2010.02137. Cited by: §I.
-  (2007) A fuzzy technique for segmentation of handwritten bangla word images. In 2007 International Conference on Computing: Theory and Applications (ICCTA’07), pp. 427–433. Cited by: §II.
-  (2017) Banglalekha-isolated: a multi-purpose comprehensive dataset of handwritten bangla isolated characters. Data in brief 12, pp. 103–107. Cited by: §II.
-  (2020) Albumentations: fast and flexible image augmentations. Information 11 (2), pp. 125. Cited by: §III-A2, §IV-B.
-  (1998) A complete printed bangla ocr system. Pattern recognition 31 (5), pp. 531–549. Cited by: §II.
-  (2015) Keras. Cited by: §IV-B.
Xception: deep learning with depthwise separable convolutions.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §IV-D1, TABLE I.
-  (2002) Optical character recognition of bangla characters using neural network: a better approach. In 2nd ICEE, Cited by: §II.
-  (2014) A benchmark image database of isolated bangla handwritten compound characters. International Journal on Document Analysis and Recognition (IJDAR) 17 (4), pp. 413–431. Cited by: §II.
-  (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §III-C.
-  (2020) Bangla continuous handwriting character and digit recognition using cnn. In Innovations in Computer Science and Engineering, pp. 555–563. Cited by: §II.
-  (2008) A high performance domain specific ocr for bangla script. In Novel algorithms and techniques in telecommunications, automation and industrial electronics, pp. 174–178. Cited by: §II.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §III-B.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §IV-D1, TABLE I.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §IV-D1, TABLE I.
-  (2007) Matplotlib: a 2d graphics environment. IEEE Annals of the History of Computing 9 (03), pp. 90–95. Cited by: §IV-B.
-  (2003) A complete ocr system for continuous bengali characters. In TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region, Vol. 4, pp. 1372–1376. Cited by: §II.
-  (2021) BanglaWriting: a multi-purpose offline bangla handwriting dataset. Data in Brief 34, pp. 106633. Cited by: 2nd item, §IV-A.
-  (2018) PHDIndic_11: page-level handwritten document image dataset of 11 official indic scripts for script identification. Multimedia Tools and Applications 77 (2), pp. 1643–1678. Cited by: §II.
-  (2021) FabricNet: a fiber recognition architecture using ensemble convnets. IEEE Access 9, pp. 13224–13236. Cited by: §I.
Exploring optimal control of epidemic spread using reinforcement learning. Scientific reports 10 (1), pp. 1–19. Cited by: §I.
-  (1994) OCR in bangla: an indo-bangladeshi language. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5), Vol. 2, pp. 269–273. Cited by: §II.
-  (2018) Shape decomposition-based handwritten compound character recognition for bangla ocr. Journal of Visual Communication and Image Representation 50, pp. 123–134. Cited by: §II.
-  (2020) Segmentation-based recognition system for handwritten bangla and devanagari words using conventional classification and transfer learning. IET Image Processing 14 (5), pp. 959–972. Cited by: §II, §II.
-  (2018) EkushNet: using convolutional neural network for bangla handwritten recognition. Procedia computer science 143, pp. 603–610. Cited by: §II.
-  (2018) Line, word, and character segmentation from bangla handwritten text—a precursor toward bangla hocr. In Advanced Computing and Systems for Security, pp. 109–120. Cited by: §II.
-  (2012) CMATERdb1: a database of unconstrained handwritten bangla and bangla–english mixed script document image. International Journal on Document Analysis and Recognition (IJDAR) 15 (1), pp. 71–83. Cited by: §II.
-  (2018) Benchmark databases of handwritten bangla-roman and devanagari-roman mixed-script document images. Multimedia Tools and Applications 77 (7), pp. 8441–8473. Cited by: §II.
-  (2013) ICDAR 2013 handwriting segmentation contest. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1402–1406. Cited by: §II.
-  (2011) The numpy array: a structure for efficient numerical computation. Computing in science & engineering 13 (2), pp. 22–30. Cited by: §IV-B.
-  (1991) Python. Cited by: §IV-B.
-  (2021) List of languages by number of native speakers — Wikipedia, the free encyclopedia. Cited by: §I.
-  (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §IV-D1, TABLE I.