Character segmentation plays an important role in optical character recognition (OCR) pipeline . One major reason for poor recognition accuracy in OCR system is the error in character segmentation. Some previous researches [1, 2, 5, 6, 13, 16, 17, 19, 23] achieve high performance on monolingual texts, but rely on feature engineering specific to single character style. Other researches [4, 10, 24, 25] work on multilingual cases but introduce complex processing pipelines. Actually, it’s difficult to manually design a set of features suitable for multilingual scene. Thus a mixture of multiple languages presents a challenge for existing character segmentation methods. Chinese/English mixed case is especially difficult due to the coexistence of touching characters and Chinese disconnected structure, as shown in Figure 1. For those ignorant of both languages, it’s confusing that a Chinese character with disconnected structure (e.g. those in Figure 1) should not be splitted apart, but a pair of touching neighboring English characters (e.g. “DL” or “AI”) should be splitted apart. Traditional projection based method will falsely break up an intact Chinese character with disconnected structure. A more advanced method  with a connected regions merging phase, tends to falsely take “DL” as an intact character. In order to correctly perform segmentation, a model should implicitly or explicitly remember all valid characters in both languages. Moreover, in order to deal with various possible font types and sizes, the model should automatically learn necessary features to recognize a valid character since it is cumbersome to specifically design features for every font.
Nowadays we know the ability of deep neural networks to perform automatic feature learning on raw data has significantly advanced the research in various fields of computer vision. Semantic segmentation is among these fields. Fortunately, the problem of multilingual character segmentation can be reframed as the problem of two-class semantic segmentation. To be specific, given a text line image, we classify each horizontal pixel into two categories: splitting point or not. With this problem re-defined, we can utilize those successful deep architectures in recent progress of semantic segmentation. Among them we choose fully convolutional networks (FCN).
In this paper, we reframe OCR character segmentation as semantic segmentation and propose a FCN architecture to solve it. We train our FCN model on synthesized samples with simulated random disturbance and show that it is able to
significantly outperform previous methods on Chinese/English mixed printed document images;
generalize well from simulated disturbance to real-word disturbance introduced by photographing;
generalize well across different text content styles;
generalize well across different font styles in most cases;
nicely handle disconnected structure and touching characters.
2 Related Work
Previous Approaches Projection based method is among the simplest approach for OCR segmentation. It calculates the average grey value for each pixel column then split every blank region in the middle, making it vulnerable to disconnected structure and touching characters. Recently, improved methods have been proposed but are only specific for single language [1, 2, 5, 6, 13, 16, 17, 19, 23]. Other researches exploit complex processing pipelines and hand-crafted rules to tackle multilingual cases [4, 10, 24, 25]. There are also researches on handwritten character segmentation [26, 22, 9, 3]. Compared with printed characters, handwritten characters often require nonlinear splitting paths rather than vertical splitting lines, which is not necessary for regular font types in normal printed documents.
Semantic Segmentation Semantic segmentation is a sub-field of computer vision. Compared with recognition problem, it progresses from coarse to fine inference by making a prediction directly at every pixel [12, 15]. To this end it requires the output size of model to match original input size. However, normal convolutional layer used in recognition only maintains or reduces the size of feature maps, so comes the deconvolution layer.
Devonlutional Layers Deconvolution is also called up-convolution 8]. It is typically used for expanding the size of feature maps in FCN architecture.
Fully Convolutional Networks FCN is prevalent in the research of semantic segmentation and object detection. The key feature that distinguishes FCN from CNN is that it is easy to control the output size of FCN via deconvolution. Therefore, FCN is also widely used in tasks where both input and output are images. For example, Simo-Serra et al.  use FCN to simplify sketch drawing.
(a) Several training samples and corresponding output ground truths. Notice that the model outputs a vector of lengthrather than a matrix of shape . (b) For each character in the image, the left and right margins are given as splitting points. Each splitting point is visualized as a vertical blue lines.
3 Proposed Approach
We firstly define the training task in Section 3.1. In Section 3.2 we propose the detailed architecture of FCN. In Section 3.3 we describe the post-processing phase when using the trained model to crop segments. Training data synthesizing process is described in Section 3.4. To deal with imbalanced classes problem, we use a dynamic weighted binary cross entropy loss, which is defined in Section 3.5.
3.1 Training Task Definition
In semantic segmentation form, our model is to classify each horizontal pixel position into two classes: splitting point as positive class and non-splitting point as negative class. Formally, given an image of height and width
as input, a FCN outputs a probability vectorof length , where
For each character in the image, the left and right margins are given as splitting points. For example, if a character has an extending range from the -th column to the -th column, then we have
See Figure 2 for several input images and corresponding output ground truths. In this paper we have and .
In a typical FCN architecture, a down-convolution block and an up-convolution block are to reduce and expand the size of feature maps, respectively 
. Each down-convolution is composed of a convolutional layer, a batch normalization layer
, a max-pooling layer and an activation layer. Each up-convolution is composed of a deconvolutional layer, a batch normalization layer and an activation layer. In research of semantic segmentation, FCN is used to restore both the width and height of original input images[12, 15].
Our FCN architecture is almost the same as typical ones except that only the width of input images need to be restored. As defined in Section 3.1, Eq. (1), it simply outputs a vector of length , which is equivalent to an image of shape . To this end, deconvolutional layers in our FCN expand the width of feature maps but maintain the height. See Figure 3 for details.
All activation layers except the last one are the ReLU layer. The last one is a sigmoid layer producing a probability output vector.
3.3 Post-processing for Cropping
As described in Section 3.1, during training phase only two splitting points are given as positive ground truth for each character. During prediction, however, points in adjacent region of a true positive point are usually classified to be positive as well. It’s inevitable because neural networks cannot fit the data exactly. In fact, those surrounding points could also be valid splitting points.
Another problem arises when we want to really crop out characters for downstream recognition, because a segment bounded by two adjacent splitting points could contain a character or just the blank between characters. The blank segments should be discarded.
Therefore, we propose a simple post-processing procedure to deal with the issues above. Firstly, the probability vector output of FCN is converted to binary vector according to a threshold (e.g. 0.5). Secondly, for each contiguous positive segments, the center point is selected as the splitting point. Thirdly, each pair of adjacent splitting points form a candidate segments. Finally, we discard the blank segments and output a list of bounding line segments. Complete pipeline is shown in Figure 4.
3.4 Synthesizing Training Data
We focus on camera photographs of Chinese/English mixed printed documents with various fonts. Compared with using scanner, camera photographing introduces much more noises, blurring, rotation and distortion, thus makes the problem more challenging. To approximate these real-world disturbance, we first plot clean texts onto blank image with a size of
then successively apply four simulated random disturbance: rotation, erosion, dilation and Gaussian blurring. Finally we binarize the grey-scale images according to threshold. We keep track of the left and right margins of each character in process above and finally convert them into corresponding binary mask vector as ground truth. Several training input images and output vectors are visualized in Figure 2.
3.5 Dynamic Weighted Binary Cross Entropy
Our model performs binary classification at each of the 2048 horizontal positions. Therefore, given FCN’s output probability vector and the ground truth vector
, we can simply define the binary cross entropy loss function as
However, the ratio of positive and negative ground truths is not balanced. Most of the ground truths are negative as shown in Figure 2. In our experiments unbalanced classes slow down model convergence or even make the model stuck in a local optimum. When stuck, the model predicts negatively at each horizontal position. To tackle this issue, we define a weighted binary cross entropy loss function
We initialize to 0.9 and
to 0.1. After each iteration we use a heuristic rule to dynamically adjust the weights according to the average positive accuracyand the average negative accuracy of the last mini-batch, where
If then we increase and decrease , otherwise we increase and decrease . Our strategy can balance the model performance on positive and negative classes throughout training and speed up convergence. See Algorithm 1 for details.
In this section, we describe the datasets for experiments, including those built by photographing printed documents and those synthesized as described in Section 3.4.
4.1.1 Photographed Dataset
The first dataset is built by photographing as follows. Firstly, text contents are randomly extracted from Baidu Baike corpus, printed with various font types, and photographed with normal phone camera. Secondly, we apply a series of traditional OCR techniques of denoising, binarization, line segmentation etc. to collect a set of text line images. Finally, we hand label the bounding line segment annotations for each character.
Because each text line image sample typically contains several dozens of characters, it takes a long time to annotate even one sample. Thus only 50 text line image samples are finally collected. Nevertheless, they totally contains 2710 characters, which are enough for reliable evaluation. In the following sections, we refer this dataset as “Photo-Normal”. Several samples are shown in Figure 5.
4.1.2 Synthesized Datasets
The quantity and diversity of Photo-Normal dataset is somewhat limited. Moreover, it can only evaluate our model’s generalization ability from simulated disturbance to real-world disturbance. To further evaluate its generalization ability, we synthesize a series of datasets with the combination of two text content styles and 36 font types (specified in next paragraph). Each combination is splitted into training and evaluation parts, producing totally 144 () datasets. Among them, one training set contains 3000 samples and one evaluation set contains 30 samples. Total number of characters is more than 10 million.
As for text content styles, the first style called “normal” simply refers to normal content in Baidu Baike corpus, and the second style called “chaotic” is acquired by randomly shuffling normal text characters. As for font styles, see Table 3 for totally 36 font types used in our experiments.
In the following sections, we refer a dataset of normal text content and font SIMYOU for training as “Train-Normal-SIMYOU”, and so on.
4.2 Evaluation Metric
We evaluate segmentation accuracy by matching predictions and ground truths. Given a text line image, FCN and post-processing procedure output a list of bounding line segments. They are aligned with the ground truth, which is a list of bounding line segments. For each predicted segment and each true segment , we denote the number of ’s horizontal pixels covered by as and the number of ’s horizontal pixels not covered as . Then we say matches if and only if
where are thresholds.
Taking is equivalent to requiring exactly matched with . However, exact matching is not necessary in practice due to some blank space between characters. Therefore we take , and in our experiments for all the methods we compare, which is fair.
Given the number of matched pairs , we define the segmentation accuracy as
We use mini-batch stochastic gradient descent for 50000 iterations with batch size 8 and momentum coefficient 0.9. For each iteration a mini-batch of samples are randomly selected from training samples. Learning rate is initialized to 0.0001 and divided by 10 at the 20000-th and 40000-th iteration. Training takes approximately 50 minutes on a single GPU (NVIDIA GeForce GTX TITAN X).
4.4 Quantitative and Qualitative Evaluation
In this section, we quantitatively evaluate whether our model can generalize
from simulated disturbance to real-word disturbance,
between normal text content and chaotic text content,
among different font types.
We also qualitatively evaluate its performance on the hardest part of Chinese/English mixed case: coexistence of disconnected components and touching characters.
The generalization over font sizes is trivial and already included throughout these experiments, thus not specifically evaluated.
4.4.1 Generalization from Simulation to Real-World
Our FCN is trained on synthesized data because it’s difficult to collect a large amount of real word samples with segment annotations. However, simulated disturbance in synthesized samples is definitely not identical to real-world disturbance. FCN must generalize well beyond simulation to deal with photographed printed document images.
To verify this, we train three FCN instances on three datasets and evaluate them on Photo-Normal dataset. Training sets are
Train-Normal-All: training samples of normal text,
Train-Chaotic-All: training samples of chaotic text,
Train-All-All: union of both above.
Sample counts are 108000, 108000 and 216000, respectively. Each of them contains all the 36 font types.
In this experiment we compare our approach with four baselines: the traditional projection based method, the region-merging based method designed for Chinese , the connected component based method designed for English  and Tesseract , an open source OCR engine still in active development111https://github.com/tesseract-ocr/tesseract. In the following sections they are referred to as PROJ, CN, EN and Tesseract, respectively.
The results are shown in Table 1. Our FCN instances significantly outperform the baselines. Among the first three baselines, PROJ is the simplest model but outperforms CN and EN, because it is not specifically designed for one single language. As a morden OCR engine targeting various languages, Tesseract achieves a decent accuracy but is still outperformed by a large margin. Among the three FCN instances, the one trained on Train-Normal-All achieves the best result, because it has the most similar text content style with Photo-Normal dataset.
All FCN instances achieve over 98% accuracy. Thus we conclude that they generalize well from simulated disturbance to real-world disturbance.
|Training Set||Evaluation Set|
4.4.2 Generalization across Text Content Styles
FCN has the advantage to utilize a wide receptive field on input image to predict at a horizontal pixel. This advantage could be a disadvantage, because it introduces the risk that FCN overfits certain style of text content. For example, if character A is always surrounded by B and C in training text content, FCN may fit such pattern. When it comes to testing text content in which A is surrounded by D end E, FCN will probably make a mistake. We want to know how serious this problem is.
In this section, five datasets are used for training and evaluation:
Train-Normal-All: training samples of normal text,
Train-Chaotic-All: training samples of chaotic text,
Train-All-All: union of above,
Eval-Normal-All: evaluation samples of normal text,
Eval-Chaotic-All: evaluation samples of chaotic text.
Sample counts are 108000, 108000, 216000, 1080 and 1080, respectively.
The results are shown in Table 2. The best performance on Eval-Normal-All and Eval-Chaotic-All are achieved by training on Train-Normal-All and Train-Chaotic-All, respectively. The first two rows show that our model generalizes well from chaotic style to normal style and only slightly worse from normal style to chaotic style.
This experiment suggests that in practice, if the text content style we want to finally work on can be accessed, it is optimal to train on the same content style. If not, training on chaotic text content still works well.
4.4.3 Generalization across Font Types
Real-world documents contain various font types. In practice we can include as much font types as possible in training sets to improve generalization. However, some particular fonts of interest may still not be included, which requires the model to generalize across different font types.
To evaluate this generalization ability, three groups of datasets are used in this section:
Train-Normal-All: training samples of normal text,
Train-Normal-exclude-XXX: training samples of normal text, containing font types except XXX,
Eval-Normal-only-XXX: evaluation samples of normal text, containing only one font type XXX.
Sample counts of each dataset in the three groups are 108000, 105000 and 1080, respectively.
The results are shown in Table 3. For each font type, we train FCN on dataset that does not include this font type and dataset that does, corresponding to the second column and third column. The second column shows that FCN generalizes well on unseen font types for the most cases. The third column shows that including corresponding font type during training further improves accuracy.
Nevertheless, there are several bad cases highlighted in the table: STCAIYUN, STLITI and STXINGKA. Their font styles are illustrated in Figure 6.
The first font type, STCAIYUN, is completely different from others because of its hollow structure. However, when it is included in training set, segmentation accuracy increases from back to . In this case, FCN generalizes badly on particular font type but can restore decent accuracy once it is included again.
However, on the second and third font types, accuracy cannot be restored even when they are included in training set. As shown in Figure 7, Most segmentation errors arise from English characters. This is because both STLITI and STXINGKA have italic English character style.To properly segment italic characters, the model should predict oblique lines rather than vertical splitting lines, which is impossible in our FCN architecture.
This experiment shows that our FCN generalizes well on most cases except those with completely different styles or italic styles. The first issue can be fixed by including such special font types in training sets. As for the second issue, we will discuss the possible solution in Section 5.
4.4.4 Handling Difficult Cases
The main challenges of Chinese/English mixed character segmentation are two-fold: first, various character widths inside and across languages and second, the coexistence of disconnected structure and touching characters.
Without the first challenge, we can calculate the widths of each connected components in a text line image then take the mode as unified character width, which is used in traditional OCR techniques. Without the second challenge, we can either tune the threshold in projection-based methods to tackle touching characters, or use a region-merging phase  to tackle disconnected structure.
Nevertheless, our FCN architecture handles these difficulties well by automatically utilizing useful features. Typical samples are shown in Figure 8.
5 Conclusion and Future Work
In this paper we tackle Chinese/English character segmentation for printed document images. By reframing it as a two-class semantic segmentation problem, we take advantage of the successful deep neural architecture called fully convolutional networks (FCN) in the field of semantic segmentation. Trained on synthesized samples with simulated random disturbance, FCN can accurately perform binary classification at each horizontal position on text line images to decide whether this position should be a splitting point or not. Our approach significantly outperforms traditional methods on segmentation accuracy. Experiments show that it is able to generalize from simulated disturbance to real-world disturbance, generalize between normal and chaotic text content styles, generalize among various font types and properly handle the coexistence of disconnected structure and touching characters.
The experimental result in Section 4.4.3
shows that our approach performs badly on characters of italic font type because FCN simply predicts vertical splitting lines rather than oblique splitting lines. In addition, there exist even more difficult cases where two characters are so close that they can only be splitted by curved lines. A possible solution and step forward is to reframe character segmentation as an instance segmentation problem. Instance segmentation is also called simultaneous detection and segmentation. In this task, instance-level information and pixel-wise accurate mask for objects are to be estimated. Ideally, with instance segmentation every single characters can be curved out exactly and cleanly. In the future we will work on this possible solution.
-  Y.-K. Chen and J.-F. Wang. Segmentation of single-or multiple-touching handwritten numeral string using background and foreground analysis. IEEE Transactions on pattern analysis and machine intelligence, 22(11):1304–1317, 2000.
-  A. Choudhary, R. Rishi, and S. Ahlawat. A new approach to detect and extract characters from off-line printed images and text. Procedia Computer Science, 17:434–440, 2013.
-  N. Dave. Segmentation methods for hand written character recognition. International Journal of signal processing, image processing and pattern rcognition, 8:154–164, 2015.
-  U. Garain and B. B. Chaudhuri. Segmentation of touching characters in printed devnagari and bangla scripts using fuzzy multifactorial analysis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 32(4):449–459, 2002.
-  A. R. Himamunanto and A. R. Widiarti. Javanese character image segmentation of document image of hamong tani. In Digital Heritage International Congress (DigitalHeritage), 2013, volume 1, pages 641–644. IEEE, 2013.
-  W. L. Hwang and F. Chang. Character extraction from documents using wavelet maxima. Image and Vision Computing, 16(5):307–315, 1998.
S. Ioffe and C. Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In D. Blei and F. Bach, editors,
Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 448–456. JMLR Workshop and Conference Proceedings, 2015.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. arXiv preprint arXiv:1603.08155, 2016.
-  A. Kaur, S. Rani, and P. Singh. Segmentation of isolated and touching characters in handwritten gurumukhi word using clustering approach. 2014.
-  S.-W. Lee, D.-J. Lee, and H.-S. Park. A new methodology for gray-scale character segmentation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10):1045–1050, 1996.
S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia.
Multi-scale patch aggregation (mpa) for simultaneous detection and
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  Y. Mei, X. Wang, and J. Wang. A chinese character segmentation algorithm for complicated printed documents. International Journal of Signal Processing, Image Processing and Pattern Recognition, 6(3):91–100, 2013.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In J. Fürnkranz and T. Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814. Omnipress, 2010.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
-  S. Nomura, K. Yamanaka, O. Katai, H. Kawakami, and T. Shiose. A novel adaptive morphological approach for degraded character image segmentation. Pattern Recognition, 38(11):1961–1975, 2005.
-  P. P. Roy, U. Pal, J. Lladós, and M. Delalandre. Multi-oriented touching text character segmentation in graphical documents using dynamic programming. Pattern Recognition, 45(5):1972–1983, 2012.
-  P. Sahare and S. B. Dhok. Review of text extraction algorithms for scene-text and document images. IETE Technical Review, pages 1–21, 2016.
-  D. Senapati, S. Rout, and M. Nayak. A novel approach to text line and word segmentation on odia printed documents. In Computing Communication & Networking Technologies (ICCCNT), 2012 Third International Conference on, pages 1–6. IEEE, 2012.
-  E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa. Learning to simplify: fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics (TOG), 35(4):121, 2016.
-  R. Smith and G. Inc. An overview of the tesseract ocr engine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR, pages 629–633, 2007.
-  J. Tan, W.-X. Wang, M.-S. Feng, and X.-X. Zuo. A new approach based on ncut clustering algorithm for signature segmentation. AASRI Procedia, 1:14–20, 2012.
-  J. Tse, C. Jones, D. Curtis, and E. Yfantis. An ocr-independent character segmentation using shortest-path in grayscale document images. In Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on, pages 142–147. IEEE, 2007.
-  K. Wang, J. Jin, and Q. Wang. High performance chinese/english mixed ocr with character level language identification. In 2009 10th International Conference on Document Analysis and Recognition, pages 406–410. IEEE, 2009.
-  K. Wang, J.-M. Jin, W.-M. Pan, G.-S. Shi, and Q.-R. Wang. Mixed chinese/english document auto-processing based on the periodicity. In Machine Learning and Cybernetics, 2004. Proceedings of 2004 International Conference on, volume 6, pages 3616–3619 vol.6, Aug 2004.
-  S. Wshah, Z. Shi, and V. Govindaraju. Segmentation of arabic handwriting based on both contour and skeleton segmentation. In 2009 10th International Conference on Document Analysis and Recognition, pages 793–797. IEEE, 2009.