DeepAI
Log In Sign Up

Chinese Character Recognition with Radical-Structured Stroke Trees

11/24/2022
by   Haiyang Yu, et al.
FUDAN University
0

The flourishing blossom of deep learning has witnessed the rapid development of Chinese character recognition. However, it remains a great challenge that the characters for testing may have different distributions from those of the training dataset. Existing methods based on a single-level representation (character-level, radical-level, or stroke-level) may be either too sensitive to distribution changes (e.g., induced by blurring, occlusion, and zero-shot problems) or too tolerant to one-to-many ambiguities. In this paper, we represent each Chinese character as a stroke tree, which is organized according to its radical structures, to fully exploit the merits of both radical and stroke levels in a decent way. We propose a two-stage decomposition framework, where a Feature-to-Radical Decoder perceives radical structures and radical regions, and a Radical-to-Stroke Decoder further predicts the stroke sequences according to the features of radical regions. The generated radical structures and stroke sequences are encoded as a Radical-Structured Stroke Tree (RSST), which is fed to a Tree-to-Character Translator based on the proposed Weighted Edit Distance to match the closest candidate character in the RSST lexicon. Our extensive experimental results demonstrate that the proposed method outperforms the state-of-the-art single-level methods by increasing margins as the distribution difference becomes more severe in the blurring, occlusion, and zero-shot scenarios, which indeed validates the robustness of the proposed method.

READ FULL TEXT VIEW PDF
06/22/2021

Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition

Chinese character recognition has attracted much research interest due t...
10/16/2022

STAR: Zero-Shot Chinese Character Recognition with Stroke- and Radical-Level Decompositions

Zero-shot Chinese character recognition has attracted rising attention i...
11/03/2017

Radical analysis network for zero-shot learning in printed Chinese character recognition

Chinese characters have a huge set of character categories, more than 20...
11/03/2017

RAN: Radical analysis networks for zero-shot learning of Chinese characters

Chinese characters have a huge set of character categories, more than 20...
07/17/2022

Stroke-Based Autoencoders: Self-Supervised Learners for Efficient Zero-Shot Chinese Character Recognition

Chinese characters carry a wealth of morphological and semantic informat...
04/09/2021

Chinese Character Decomposition for Neural MT with Multi-Word Expressions

Chinese character decomposition has been used as a feature to enhance Ma...
10/14/2020

Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems

A practical automatic textual math word problems (MWPs) solver should be...

1 Introduction

As a language used by a huge population of 1.31 billion people, Chinese character recognition (CCR) has attracted extensive research interests Pal et al. (2012); Yang et al. (2015); Meng et al. (2001); Xiaobo et al. (2003); Zu et al. (2022); Chen et al. (2021b). Although CCR has been explored for more than thirty years, the existing methods are vulnerable to some challenging situations, especially when the distribution of characters for testing differs from that for training. Concretely, the discrepancies in the categorical distribution (e.g., characters to be tested are absent from the training dataset, namely zero-shot problems) or the visual distribution (e.g., blurring or occlusions in real-world scene images) still brings difficulties to the existing methods.

Existing CCR methods can be categorized into three approaches in terms of the recognized components, including character-based, radical-based, and stroke-based approaches. The character-based approach Xiao et al. (2017); Zhang et al. (2017); Wang and Du (2021) usually perceives each character as a unique category and captures the global features such as contours to obtain the character predictions. However, it is sensitive to the categorical distribution changes, especially when the tested character does not appear in the training dataset, namely the character zero-shot problem (see the example “You” in Fig. 1). To tackle the character zero-shot problem, the radical-based approach Wang et al. (2018, 2019) decomposes characters at the radical level and represents each Chinese character as a tree of radicals organized according to its radical structures. However, although it can alleviate the character zero-shot problem to some extent, another conundrum called the radical zero-shot problem may arise if some uncommon radicals are absent from the training dataset. Moreover, the radical-based approach is sensitive to the visual distribution changes induced by blurring and occlusions (see the example “An” in Fig. 1). Further, the stroke-based approach Chen et al. (2021a) is proposed to decompose characters into stroke sequences (i.e., the atomic units of Chinese characters), which can fundamentally overcome the categorical distribution changes. However, this approach leads to an undesirable one-to-many problem as one stroke sequence may correspond to more than one character (see the example “Dan” in Fig. 1). Therefore, none of the aforementioned single-level approaches can tackle all these challenges by itself. In this paper, we seek to answer this question: Is there a more decent way to represent Chinese characters for tackling CCR?

To answer this question, we view each Chinese character as a Radical-Structured Stroke Tree (RSST) to fully exploit the merits of both radical-level and stroke-level representations. Specifically, we propose a framework consisting of a Feature-to-Radical Decoder (FRD) and a Radical-to-Stroke Decoder (RSD). The former decoder predicts the radical structures of the input character image and perceives the radical regions; the latter takes the features of radial regions as input to predict corresponding radical sequences. The intermediate outputs of FRD and RSD are integrated into a RSST to represent each Chinese character. Considering the stroke tree may not precisely match one specific character, in the Tree-to-Character Translator (TCT), we design the Weighted Edit Distance as the metric that takes both the radical structures and strokes into account. The extensive experimental results show that our method outperforms the existing single-level methods by increasing margins as the distribution difference becomes more severe in the blurring, occlusion, and zero-shot scenarios, which indeed validates the robustness of the proposed RSST representation. In summary, the contributions can be listed as follows:

  • We represent each Chinese character as a RSST to exploit the merits at the radical and stroke levels for tackling CCR.

  • We propose a hierarchical decomposition framework, where a Feature-to-Radical Decoder predicts the radical structures and the locations of radicals, and a Radical-to-Stroke Decoder predicts the strokes of each radical.

  • To alleviate the one-to-many problem, we propose the Weighted Edit Distance taking both radical structures and strokes into account to search for the most reasonable candidate in the pre-defined RSST lexicon.

  • Our method outperforms the SOTA single-level methods when the distribution changes greatly during testing, which further validates the superiority of our method.

Figure 2: The preliminary knowledge of decomposing Chinese characters at the radical level and stroke level.
Figure 3: The overall architecture consists of a character decomposition framework. Radical structures in radical-level prediction and strokes in stroke-level prediction are combined into the RSST .

2 Preliminaries

2.1 Chinese Character Decomposition

According to the national standard GB18030-2005, there are 70,244 characters in total. Among them, 3,755 characters are commonly-used Level-1 characters. Each Chinese character can be decomposed into a radial or stroke sequence in a specific order. At the radical level, there are 12 radical structures (see Fig. 2(a) and 514 radicals for 3,755 Level-1 characters. Each character can either be represented as a radical-structured tree or a radical sequence through the depth-first search (see Fig. 2(b)). At the stroke level, each Chinese character can be composed by five basic strokes, including horizontal, vertical, left-falling, right-falling, and turning. There are two or more instances in each category (see Fig. 2(c)).

2.2 Chinese Character Recognition

According to the different ways of character decomposition, existing CCR methods can be divided into three approaches, i.e., character-based, radical-based, and stroke-based approach.

Character-Based Approach

As convolution neural networks

Krizhevsky et al. (2012) thrived over the last decade, CCR has been pushed forward by a large margin. For example, MDCNN Cireşan and Meier (2015)

designs a series of deep neural networks consisting of convolutional blocks, and ensembles their results to produce the final prediction. Zhang et al.

Zhang et al. (2017) view traditional features (e.g., Gabor and HoG maps) as priors and incorporated them into neural networks to learn more robust representations. In Xiao et al. (2019), the authors propose a template-instance loss, which regard printed fonts as prototypes Snell et al. (2017) for constraining features of the same class to be close in the latent space. Generally, the character-based approach is sensitive to categorical distribution changes and usually fails to recognize those characters that have not appeared during training, namely the character zero-shot problem.

Radical-Based Approach In the past five years, some methods try to represent each Chinese character at the radical level to alleviate the limitations induced by the difference of categorical distribution Li et al. (2018); Zhou et al. (2016); Tang et al. (2017); Xu et al. (2019); Cheng et al. (2016); Zhong et al. (2016)

. For instance, inspired by image captioning, DenseRAN

Wang et al. (2018) takes the first attempt to iteratively generate the radical sequence of a given character, which are matched with a collection of pre-defined ideographic description sequences (IDS) using edit distance to yield the final character prediction. HDE Cao et al. (2020) manually designs embeddings for each character using radical-composition knowledge while following the embedding-matching strategy for prediction in the test stage. Albeit these methods can alleviate the problems caused by the difference of categorical distribution, as the distribution discrepancy further increases, the radical zero-shot problem will arise if some uncommon radicals are absent from the training dataset.

Stroke-Based Approach To fundamentally address the problems incurred by the categorical distribution discrepancy, Chen et al. Chen et al. (2021a) decompose each Chinese character into stroke sequence. However, representing characters at the stroke level will lead to a severe one-to-many problem as each stroke sequence may correspond to many characters. Although it shows superiority in addressing the zero-shot problems, its performance drops obviously as the discrepancy of visual distribution increases.

3 Methodology

The architecture of our method is shown in Fig. 3. Specifically, given an image , we utilize an encoder based on ResNet-34 He et al. (2016) to extract the feature maps , which are sequentially sent to two decoders and a translator to yield the character-level predictions. Different from the previous methods focusing on a single level, we represent each character as a RSST to fully exploit the merits at different levels.

3.1 Feature-to-Radical Decoder

The Feature-to-Radical Decoder (FRD) aims to decompose each character at the radical level. Specifically, we build this decoder based on Transformer Vaswani et al. (2017), which recently proves to be effective in vision tasks Dosovitskiy et al. (2021). The feature maps generated by the encoder are first reshaped to , then fed to the Transformer along with the right-shifted radical level label following the force teaching manner used in Shi et al. (2018). Through Transformer layers (see detailed configurations in Supplementary Material), the input right-shifted sequence is transformed into the predicted radical sequence. In fact, we can simply calculate the cross-entropy loss to supervise the radical-level predictions. However, due to the severe class imbalanced problem induced by the radical-level decomposition mentioned in Chen et al. (2021a), we observe that FRD is weak to recognize those low-frequency radicals. In this case, the generated attention maps cannot precisely correspond to the position of the radical at each time step, which brings difficulties to the downstream Radical-to-Stroke Decoder (see more discussions in Supplementary Material). Therefore, we replace the explicit radicals of original radical-level labels with their relative positions in the corresponding radical tree for implicit supervision (see “Radical GT ” in Fig. 3). We denote the modified radical-level label as and the predicted sequence as . Finally, the FRD is supervised by the cross-entropy loss as follows:

(1)

3.2 Radical-to-Stroke Decoder

The Radical-to-Stroke Decoder (RSD) aims to further decompose FRD’s intermediate outputs (i.e., features of radical regions) into stroke sequences. The RSD is also built upon Transformer, whose architecture is the same as that used in the FRD. Through the FRD, a sequence of attention maps are generated, where and is the maximum length of the sequence. We simply multiply the flattened feature map and to yield the radical-focused features , where . To further represent Chinese characters at the stroke level, we feed the right-shifted stroke-level label and radical-focused features to the RSD for generating the stroke-level prediction . We denote the ground truth as and define at the -th time step in the following two ways: (1) If it corresponds to a radical structure, we concatenate the stroke sequence of all its leaf nodes in the depth-first-search order as . (2) If it corresponds to a radical, we simply set to the stroke sequence of the radical (see “Stroke GT ” in Fig. 3). In this way, the attention maps of radical structures can cover all the regions of its leaf nodes so as to yield more accurate radical-structure predictions. Finally, the RSD is supervised by cross-entropy loss:

(2)

Based on these preparations, we integrate the radical structures generated in the FRD and the stroke sequences of the leaf nodes generated in the RSD, to construct a RSST for representing Chinese characters (see Fig. 3). The way for tree-to-character translation will be introduced in Sec. 3.4.

Figure 4: The illustration of tree distance and stroke distance.

3.3 Overall Loss Function

The overall loss function can be calculated as follows:

(3)

where

is a hyperparameter to balance the two loss functions, which will be discussed in Sec.

4.3.

3.4 Tree-to-Character Translator

We design a Tree-to-Character Translator (TCT), which is only used during testing, to transform the generated tree into a specific character through rectification and matching (i.e., Distance Calculator and Similarity Calculator in Fig. 3).

Rectification Since the predicted tree may fail to match any trees in the pre-defined RSST lexicon that contains tree representation of all characters, it is necessary to rectify to match the closest candidate. For the stroke-based method Chen et al. (2021a), the authors simply used the edit distance to rectify the predicted stroke sequence. In fact, the stroke distance cannot serve as a robust metric since it ignores the spatial cues (i.e., radical structures). An example for illustration is shown in Supplementary Materials.

To mitigate the aforementioned problems, we propose a Weighted Edit Distance (WED) that takes the tree structure of the predicted sequence into account. Inspired by HDE Cao et al. (2020), we assign different weights for the radical structures and stroke sequences according to their levels in the tree. Practically, for an element at the -th level, we set its weight to (see Fig. 4), where is empirically set to 0.5. To calculate the distance between and a candidate in the RSST lexicon , we define the state transition equation for dynamic programming as follows:

(4)

where denotes the distance between the first elements of and the first elements of in the depth-first-search sequence. and are the corresponding weights for the element and , respectively. As the initial state, is set to 0. and denote the vanilla edit distance and length of sequence.

Therefore, the WED of the whole tree is calculated as follows (the details and examples are shown in Supplementary Material):

(5)

As shown in Fig. 4, we combine and the stroke sequence edit distance used in Chen et al. (2021a) to construct a more comprehensive distance metric :

(6)

where is empirically set to 1, and is calculated as follows:

(7)

where represents the concatenation while and represent the sub-sequences of and that only contain strokes (see “Stroke Distance” in Fig. 4). Finally, the tree candidate having the minimum is chosen as the rectified prediction . Please note that if the predicted can precisely match a candidate in , we directly set .

Matching Since the rectified tree may correspond to many characters (i.e., the one-to-many problem), we also utilize a Siamese architecture, as designed in Chen et al. (2021a), to match the rectified prediction with a specific character to further improve the performance. We collect a confusable set containing those tree sequences that can match more than one character (see Fig. 3). Compared with Chen et al. (2021a), our method contains a smaller confusable set since we take both radical structures and strokes into consideration for representing Chinese characters (details are shown in Supplementary Materials). Concretely, among 3,755 Level-1 characters, there are 280 confusable characters for the stroke-based approach and only 111 confusable characters for our radical-structured tree representation.

In the test phase, if can match a specific character (), the character is directly viewed as the final prediction. Otherwise, we compare the feature maps with the features of support samples having the same tree representation. Specifically, is fed into the same encoder to yield a list of feature maps , where is the number of candidates. Then we calculate the Cosine distance between and the feature maps of each candidate, then choose the candidate with the maximum feature similarity as the final prediction.

Figure 5: Examples of the character “You” in each dataset.

4 Experiments

4.1 Datasets and Experimental Settings

Datasets We conduct the experiments with the following datasets (examples of each dataset are shown in Fig. 5).

  • HWDB1.0-1.1 Liu et al. (2013) comprises 2,678,424 handwritten Chinese character images written by 720 writers. It contains 3,881 character classes covering all the 3,755 Level-1 characters.

  • ICDAR2013 Yin et al. (2013) is collected from 60 writers (different from the writers of HWDB1.0-1.1 Liu et al. (2013)), containing 224,419 offline handwritten images with 3,755 classes.

  • Printed Artistic Characters Chen et al. (2021a) collects 105 printed artistic fonts to yield 394,275 images for 3,755 classes.

  • CTW Yuan et al. (2019) is collected from the street view. It has 760,107 samples for training and 52,765 samples for testing. These samples contain 3,650 classes.

  • Occluded and Blurred Characters are constructed by exerting three degrees (Easy, Medium, and Hard) of occlusion and blur on ICDAR2013 Yin et al. (2013). The details are shown in Supplementary Material.

  • Support Samples are generated by two widely-used fonts including Simsun and Simfang (not in the above-mentioned artistic fonts).

Details of Zero-Shot Settings We follow Chen et al. (2021a) to construct datasets for the zero-shot settings. The configuration of each notation will be detailed in Sec. 4.3 and Sec. 4.4.

Character Zero-Shot Setting consists of several sub-settings controlled by the parameter . Concretely, we collect samples of the first m classes in the alphabet from dataset for training while collecting samples of the last classes in dataset for testing.

Radical Zero-Shot Setting consists of several sub-settings controlled by the parameter . According to the frequency of each radical in 3,755 Level-1 characters, if a character with a radical that appears more than n times, the samples of this character in are used for training, otherwise those samples in are used for testing.

for Handwritten Character Zero-Shot
500 1000 1500 2000 2755
0.01 5.67% 21.62% 32.99% 37.60% 33.93%
0.1 11.56% 21.83% 35.32% 39.22% 47.44%
1 10.99% 23.92% 30.97% 36.96% 42.56%
10 4.60% 12.63% 19.66% 24.16% 33.11%
Table 1: Choices of to balance and .
Method for Handwritten Character Zero-Shot
500 1000 1500 2000 2755
w/o 8.51% 17.53% 29.95% 33.52% 41.51%
w/o 10.90% 18.86% 28.20% 32.20% 39.92%
Ours 11.56% 21.83% 35.32% 39.22% 47.44%
Table 2: Choices of and for the distance metric.
Handwritten for Character Zero-Shot Setting for Radical Zero-Shot Setting
500 1000 1500 2000 2755 50 40 30 20 10
DenseRAN Wang et al. (2018) 1.70% 8.44% 14.71% 19.51% 30.68% 0.21% 0.29% 0.25% 0.42% 0.69%
HDE Cao et al. (2020) 4.90% 12.77% 19.25% 25.13% 33.49% 3.26% 4.29% 6.33% 7.64% 9.33%
Chen et al. Chen et al. (2021a) 5.60% 13.85% 22.88% 25.73% 37.91% 5.28% 6.87% 9.02% 14.67% 15.83%
Ours 11.56% 21.83% 35.32% 39.22% 47.44% 7.94% 11.56% 15.13% 15.92% 20.21%
Printed Artistic for Character Zero-Shot Setting for Radical Zero-Shot Setting
500 1000 1500 2000 2755 50 40 30 20 10
DenseRAN Wang et al. (2018) 0.20% 2.26% 7.89% 10.86% 24.80% 0.07% 0.16% 0.25% 0.78% 1.15%
HDE Cao et al. (2020) 7.48% 21.13% 31.75% 40.43% 51.41% 4.85% 6.27% 10.02% 12.75% 15.25%
Chen et al. Chen et al. (2021a) 7.03% 26.22% 48.42% 54.86% 65.44% 11.66% 17.23% 20.62% 31.10% 35.81%
Ours 23.12% 42.21% 62.29% 66.86% 71.32% 13.90% 19.45% 26.59% 34.11% 38.15%
Scene for Character Zero-Shot Setting for Radical Zero-Shot Setting
500 1000 1500 2000 3150 50 40 30 20 10
DenseRAN Wang et al. (2018) 0.15% 0.54% 1.60% 1.95% 5.39% 0% 0% 0% 0% 0.04%
HDE Cao et al. (2020) 0.82% 2.11% 3.11% 6.96% 7.75% 0.18% 0.27% 0.61% 0.63% 0.90%
Chen et al. Chen et al. (2021a) 1.54% 2.54% 4.32% 6.82% 8.61% 0.66% 0.75% 0.81% 0.94% 2.25%
Ours 1.41% 2.53% 4.59% 9.32% 13.02% 1.21% 1.29% 1.89% 2.90% 3.88%
Table 3: The experimental results of character zero-shot (left) and radical zero-shot (right) settings.

4.2 Implementation Details

We implement our method with PyTorch. The training and evaluation of all subsequent experiments are both conducted on an NVIDIA RTX 2080Ti GPU with 11GB memory. The Adadelta optimizer

Zeiler (2012) is utilized with the learning rate set to 0.1. The batch size is set to 8. Each input image is resized to 32 × 32 and normalized to [-1,1]. We only use one Transformer layer () in the FRD and RSD. Please note that no model ensemble and data augmentation strategies are used. We train the model with the proposed RSST representation for all settings. Then, in the standard settings (i.e., all settings except the zero-shot settings), the model is fine-tuned with character-level labels.

4.3 Choices of Parameters

We conduct the following experiments using HWDB1.0-1.1 Liu et al. (2013) in the character zero-shot setting. Specifically, we set and to HWDB1.0-1.1 and ICDAR2013, respectively. The alphabet is set to 3,755 Level-1 characters and is set to 1,000. Please note that all experiments in Sec. 4.4 are based on the same selected hyperparameters.

Choices of To deeply dive into the weight for the overall loss , we explore from {0.01, 0.1, 1, 10}. As shown in Tab. 1, when , the performance surpasses all its counterparts in all cases. Interestingly, we observe that radical structures are easier to distinguish than strokes as the radical structures are visually apparent. Hence, we give the radical-structure loss a lower weight (set to 0.1) in the following experiments.

Choices of and We try to investigate the effectiveness of and for the overall distance metric . As shown in Tab. 2, both of the two terms perform crucial roles for the matching process. Specifically, our method outperforms previous SOTA methods by almost 10% when . When or is removed from the distance metric, the performance is declined by a large margin. Even so, the declined performance is still better than previous methods focusing on a single-level representation.

4.4 Experimental Results

We conduct experiments on the abovementioned datasets. Visualizations of experimental results are shown in Supplementary Materials.

Experiments on Handwritten Characters We set to the HWDB1.0-1.1 dataset and to ICDAR2013. We set the alphabet to 3,755 Level-1 characters and to 1,000. As shown in the top-left of Tab. 3, the proposed method outperforms the previous methods by a large margin in the character zero-shot and radical zero-shot settings. More specifically, when is set to 2,755, the performance of our method is 9.53% better than the stroke-based approach Chen et al. (2021a) and 16.76% better than the radical-based approach Wang et al. (2018). Although the various handwritten styles may result in visual distribution changes, our method can robustly overcome this visual discrepancy even if the characters or radicals are absent from the training dataset.

Method Occluded Blurred
Characters Characters
CharRep 33% 50%
CharRep + DataAug 38% 69%
Ours + CharRep 44% 74%
Table 4: The experimental results of occluded and blurred characters images that are collected from the CTW dataset.
Figure 6: Examples of the occluded and blurred character images selected from the CTW dataset.

Experiments on Printed Artistic Characters We set both and to the whole dataset for the zero-shot settings. We set the alphabet to 3,755 Level-1 characters and to 1,000. The image quality of this dataset is relatively clearer (e.g., the strokes are usually easy to distinguish) compared with other datasets. Through the experimental results, the proposed method outperforms previous methods, by a larger margin ranging from 5.88% to 15.99% in the character zero-shot setting and from 2.22% to 5.97% in the radical zero-shot setting (see the middle row of Tab. 3), which further validates the superiority of our RSST representation.

Method Occluded Characters Blurred Characters None
Hard Medium Easy Hard Medium Easy
DenseRAN Wang et al. (2018) 30.77% 39.16% 53.17% 20.94% 44.70% 58.15% 96.66%
HDE Cao et al. (2020) 18.25% 24.98% 37.69% 20.91% 36.64% 54.88% 97.14%
Chen et al. Chen et al. (2021a) 29.63% 35.99% 57.62% 20.17% 46.28% 59.22% 96.74%
CharRep 28.63% 37.93% 55.37% 21.18% 45.11% 59.28% 96.17%
Ours 29.84% 38.86% 60.12% 23.91% 47.36% 59.43% 96.05%
Ours + CharRep 41.59% 52.84% 70.45% 30.17% 52.70% 64.07% 97.42%
Table 5: The experimental results of occluded and blurred character settings. “None” denotes no blur or occlusion are used for the testing dataset. “CharRep” denotes that the model is trained or fine-tuned with character-level representation. More experimental results of other character-based methods are shown in Supplementary Material.

Experiments on Scene Characters We set both and to the whole dataset for the zero-shot settings. We sort the 3,650 characters contained in CTW according to their positions in the national standard GB18030-2005 to construct the alphabet and set to 500. Different from the handwritten and printed artistic datasets, most samples in CTW have low resolution and complicated backgrounds. As shown in the bottom of Tab. 3, the proposed method outperforms the previous methods in most cases. Generally, the performance on scene characters still has room for improvement since scene character images are always accompanied by complex backgrounds, blurring, and occlusions.

Experiments on Occluded and Blurred Characters We set to HWDB1.0-1.1 and to ICDAR2013 with different degrees of blurs and occlusions. As shown in Tab. 5, we observe that our model outperforms other methods, which validates the strong ability of the proposed RSST representation. Moreover, after the model is pre-trained with the RSST representation, we attempt to fine-tune the model with character-level labels to further take advantage of the merits of character level. Specifically, we simply enlarge the output size of the FRD and set the maximum output length to 1 for directly generating the character-level predictions. We observe that the fine-tuned model is more robust with regard to the challenging scenarios. Moreover, the performance is much better than the character-based approach (refer to “CharRep” in Tab. 5) benefiting from the multi-level knowledge. To evaluate the performance on occluded and blurred characters with real-world deteriorations, we manually collect 100 blurred and 100 occluded samples from the CTW dataset (examples are shown in Fig. 6), which are derived from the real-world scenes. As shown in Tab. 4, after the model is pre-trained with the proposed RSST representation, the performance boosts from 33% to 44% (11% ) on the blurred samples and from 50% to 74% (24% ) on the occluded samples. In contrast, when using data augmentation, the performance improvement is only 5% on the blurred samples and 19% on the occluded samples.

5 Discussions

Failure Cases Some failure cases are shown in Fig. 7. We observe that the connected strokes bring difficulties to our method. For the handwritten character “Tu”, the left-falling and right-falling strokes at the end of the sequence are merged together, which confuses our model thus yielding the wrong predictions. This situation also exists in the blurred character “Nao” and the artistic character “Tuo”. In particular, the connected strokes are more difficult to identify in a blurred situation. The oblique characters will also hamper the performance of our model (refer to the scene character “Le”). It is reasonable since the proposed RSD concentrates on the stroke level, thus is sensitive to the rotation angle of the character. For the occluded character “Chi”, our method mistakenly recognizes this character as “Jin” since the right-falling at the end of the sequence is occluded.

Time Efficiency For fair comparison, we set the batch size to 32 and use ResNet-34 as the encoder and Transformer as the decoder to compare the time efficiency. Specifically, we take the average time for 100 iterations in the test stage. Through the experiments, the methods focusing on a single level cost less time (e.g., 0.06s, 0.33s, 0.65s for the character-based, the radical-based, and the stroke-based methods, respectively) than our method (1.25s). Since our method tries to hierarchically decompose characters at the radical and stroke level to generate more robust tree representations, we sacrifice part of the time efficiency in pursuit of better recognition performance. Furthermore, the time efficiency can be further optimized since the features of all characters in the confusable set can be obtained in advance. It is worth mentioning that in the standard settings we only use the proposed RSST representation to pretrain the model and then fine-tune the model at the character level, i.e., the time cost of our method in the standard settings is as the same as the character-based methods.

Figure 7: Visualizations of some failure cases.

6 Conclusions

In this paper, we propose a new representation called Radical-Structured Stroke Trees (RSST) to tackle CCR. A decomposition framework is put forward to generate the radical structures and strokes of each radical, which can be further organized into a RSST. Furthermore, we propose a Tree-to-Character Translator with the designed Weighted Edit Distance to match a specific tree in the RSST lexicon. The experimental results validate that our method outperforms the SOTA single-level methods by increasing margins as the distribution difference becomes more severe in the blurring, occlusion, and zero-shot scenarios, which validates the superiority of the proposed RSST representation.

References

  • Z. Cao, J. Lu, S. Cui, and C. Zhang (2020) Zero-shot handwritten chinese character recognition with hierarchical decomposition embedding. Pattern Recognition 107, pp. 107488. Cited by: Table 8, Appendix E, Appendix G, §2.2, §3.4, Table 3, Table 5.
  • J. Chen, B. Li, and X. Xue (2021a) Zero-shot chinese character recognition with stroke-level decomposition. In IJCAI, Cited by: Appendix B, Table 8, Appendix F, §1, §2.2, §3.1, §3.4, §3.4, §3.4, 3rd item, §4.1, §4.4, Table 3, Table 5.
  • J. Chen, H. Yu, J. Ma, M. Guan, X. Xu, X. Wang, S. Qu, B. Li, and X. Xue (2021b) Benchmarking chinese text recognition: datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093. Cited by: §1.
  • C. Cheng, X. Zhang, X. Shao, and X. Zhou (2016) Handwritten chinese character recognition by joint classification and similarity ranking. In ICFHR, pp. 507–511. Cited by: §2.2.
  • D. Cireşan and U. Meier (2015) Multi-column deep neural networks for offline handwritten chinese character classification. In IJCNN, pp. 1–6. Cited by: §2.2.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §3.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Table 8, §3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Vol. 25, pp. 1097–1105. Cited by: §2.2.
  • Z. Li, N. Teng, M. Jin, and H. Lu (2018) Building efficient cnn architecture for ffline handwritten chinese character recognition. International Journal on Document Analysis and Recognition 21 (4), pp. 233–240. Cited by: §2.2.
  • C. Liu, F. Yin, D. Wang, and Q. Wang (2013) Online and offline handwritten chinese character recognition: benchmarking on new databases. Pattern Recognition 46 (1), pp. 155–162. Cited by: Appendix E, 1st item, 2nd item, §4.3.
  • H. M. Meng, W. Lo, B. Chen, and K. Tang (2001) Generating phonetic cognates to handle named entities in english-chinese cross-language spoken document retrieval. In ASRU Workshop, pp. 311–314. Cited by: §1.
  • S. Pal, U. Pal, and M. Blumenstein (2012) Off-line english and chinese signature identification using foreground and background features. In IJCNN, pp. 1–7. Cited by: §1.
  • B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2035–2048. Cited by: §3.1.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §2.2.
  • Y. Tang, B. Wu, L. Peng, and C. Liu (2017)

    Semi-supervised transfer learning for convolutional neural network based chinese character recognition

    .
    In ICDAR, Vol. 1, pp. 441–447. Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §3.1.
  • T. Wang, Z. Xie, Z. Li, L. Jin, and X. Chen (2019) Radical aggregation network for few-shot offline handwritten chinese character recognition. Pattern Recognition Letters 125, pp. 821–827. Cited by: Table 8, §1.
  • W. Wang, J. Zhang, J. Du, Z. Wang, and Y. Zhu (2018) Denseran for offline handwritten chinese character recognition. In ICFHR, pp. 104–109. Cited by: Table 8, §1, §2.2, §4.4, Table 3, Table 5.
  • Z. Wang and J. Du (2021) Joint architecture and knowledge distillation in cnn for chinese text recognition. Pattern Recognition 111, pp. 107722. Cited by: §1.
  • X. Xiao, L. Jin, Y. Yang, W. Yang, J. Sun, and T. Chang (2017) Building fast and compact convolutional neural networks for offline handwritten chinese character recognition. Pattern Recognition 72, pp. 72–81. Cited by: §1.
  • Y. Xiao, D. Meng, C. Lu, and C. Tang (2019) Template-instance loss for offline handwritten chinese character recognition. In ICDAR, pp. 315–322. Cited by: Table 8, Appendix E, §2.2.
  • L. Xiaobo, L. Xiaojing, and H. Wei (2003) Vehicle license plate character recognition. In ICNNSP, Vol. 2, pp. 1066–1069. Cited by: §1.
  • Q. Xu, X. Bai, and W. Liu (2019) Multiple comparative attention network for offline handwritten chinese character recognition. In ICDAR, pp. 595–600. Cited by: §2.2.
  • L. Yang, P. Wang, H. Li, Z. Li, and Y. Zhang (2020) A holistic representation guided attention network for scene text recognition. Neurocomputing 414, pp. 67–75. Cited by: Appendix A.
  • W. Yang, L. Jin, and M. Liu (2015) Chinese character-level writer identification using path signature feature, dropstroke and deep cnn. In ICDAR, pp. 546–550. Cited by: §1.
  • F. Yin, Q. Wang, X. Zhang, and C. Liu (2013) ICDAR 2013 chinese handwriting recognition competition. In ICDAR, pp. 1464–1470. Cited by: Appendix D, Table 8, Appendix E, 2nd item, 5th item.
  • T. Yuan, Z. Zhu, K. Xu, C. Li, T. Mu, and S. Hu (2019) A large chinese text dataset in the wild. JCST 34 (3), pp. 509–521. Cited by: Appendix E, 4th item.
  • M. D. Zeiler (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.2.
  • X. Zhang, Y. Bengio, and C. Liu (2017) Online and offline handwritten chinese character recognition: a comprehensive study and new benchmark. Pattern Recognition 61, pp. 348–360. Cited by: §1, §2.2.
  • Z. Zhong, X. Zhang, F. Yin, and C. Liu (2016) Handwritten chinese character recognition with spatial transformer and deep residual networks. In ICPR, pp. 3440–3445. Cited by: §2.2.
  • Z. Zhong, L. Jin, and Z. Xie (2015) High performance offline handwritten chinese character recognition using googlenet and directional feature maps. In ICDAR, pp. 846–850. Cited by: Table 8.
  • M. Zhou, X. Zhang, F. Yin, and C. Liu (2016) Discriminative quadratic feature learning for handwritten chinese character recognition. Pattern Recognition 49, pp. 7–18. Cited by: §2.2.
  • X. Zu, H. Yu, B. Li, and X. Xue (2022) Chinese character recognition with augmented character profile matching. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pp. 6094–6102. External Links: ISBN 9781450392037, Link Cited by: §1.

Appendix A Configurations for Transformer Layer

In this section, we introduce the configurations of the Transformer layer used in both the Feature-to-Radical Decoder and Radical-to-Stroke Decoder. The Transformer layer consists of three modules, including the masked multi-head attention module (Masked MHA), the multi-head attention module (MHA), and the feed-forward module, the architecture of which is shown in Fig. 8. Following Yang et al. (2020), we only use one Transformer layer for each decoder, and some hyperparameters of this layer are shown in Tab. 6.

Figure 8: The architecture of the Transformer layer. It follows the basic design of the original Transformer decoder.
Hypermeter Value
The number of decoder blocks 1
Number of heads 4
Dimensionality of positional encoding 1024
Dimensionality of embedding 1024
Table 6: Hyperparameters and the corresponding values in the Transformer layer.

Appendix B Comparison between Edit Distance and Weighted Edit Distance

As shown in Fig. 9, Chen et al. Chen et al. (2021a) simply chose the first candidate “Xu” as the final prediction, where the edit distance is employed as the metric. When the radical structures are taken into account in the proposed weighted edit distance, the character “Dan” is more reasonable.

Figure 9: Our metric can robustly rectify the predicted sequence as it takes both the radical structures and strokes into account.

Appendix C Supervision for Feature-to-Radical Decoder

The Feature-to-Radical Decoder (FRD) aims to generate the radical-structures and the position of each radical (i.e., the generated attention maps). Due to the severe class imbalance problem induced by the radical-level decomposition, the FRD will be weak to recognize those low-frequency radicals and fail to perceive their positions if we utilize the explicit radicals for supervision. Therefore, we replace the explicit radicals with their orders of appearance in the radical sequence for implicit supervision. We conduct experiments to explore the impact of explicit radical supervision and implicit order supervision. As shown in Tab. 7, when using explicit radicals for supervision, the performance declines by a large margin, which stems from the drifted attention maps of radicals. (see Fig. 10).

Figure 10: Comparison between implicit supervision and explicit supervision. When using the implicit orders for supervision, the attention maps can better correspond to the position of radical-structure or radical at each time step.
Supervision for Handwritten Character Zero-Shot
500 1000 1500 2000 2755
Explicit Supervision 5.43% 12.35% 19.56% 27.43% 32.00%
Implicit Supervision 11.74% 22.91% 36.33% 40.27% 48.00%
Table 7: Performance comparison between explicit and implicit supervision.

Appendix D Preparation for Occluded and Blurred Character Datasets

In this section, we introduce the way of constructing the occluded and blurred character datasets based on ICDAR2013 Yin et al. (2013). For occluded character datasets, we exert different degrees (Easy, Medium, and Hard) of occlusion on the samples to generate the corresponding dataset. Specifically, we utilize a rectangular block of area to mask the character image, where is the number of pixels and ranges in {64, 144, 256} representing Easy, Medium, and Hard respectively. For blurred character datasets, we utilize three types of blurred operations, including Median Blur, Gaussian Blur, and Motion Blur (the example of each blur operation is shown in Fig. 11). Concretely, we generate Easy blurred dataset by exerting Median Blur, Medium by exerting Median Blur and Gaussian Blur, and Hard by all of the three blurred operations. Some examples of each occluded and blurred character dataset are shown in Fig. 12.

Figure 11: The examples of the three blur operations, including Median Blur, Gaussian Blur, and Motion Blur.
Figure 12: Some examples of occluded and blurred datasets. “Easy”, “Medium”, and “Hard” represent the different degrees of the occluded or blurred noise.

Appendix E More Experimental Results of Non-Zero-Shot Settings

We conduct more experiments in non-zero-shot settings which represents that all categories in the testing dataset appear in the training dataset. For the non-zero-shot setting of handwritten characters, we use HWDB1.0-1.1 Liu et al. (2013) as the training dataset and ICDAR2013 Yin et al. (2013) as the testing dataset. For the non-zero-shot setting of scene characters, we use the characters collected from the training and testing dataset of CTW Yuan et al. (2019) as the training and testing dataset of this setting, respectively. In the non-zero-shot settings, We adopt the same training strategy as in the occluded and blurred settings, where the model is first pretrained by the radical-structured stroke tree representation and then fine-tuned by the character-level representation. As shown in Tab. 8, our method achieves the comparable performance in both the non-zero-shot settings of handwritten and scene characters. Please note that most of the previous methods utilize other strategies to achieve better performance (e.g., HDE Cao et al. (2020) uses the data augmentation strategies and Template+Instance Xiao et al. (2019) adds additional printed characters as the prior knowledge).

Method ICDAR2013 CTW
Human Performance Yin et al. (2013) 96.13% -
HCCR-GoogLeNet Zhong et al. (2015) 96.35% -
ResNet152 He et al. (2016) - 80.94%
DenseRAN Wang et al. (2018) 96.66% 85.56%
FewshotRAN Wang et al. (2019) 96.97% 86.78%
HDE Cao et al. (2020) 97.14% 89.25%
Chen et al. Chen et al. (2021a) 96.74% 85.90%
Template+Instance Xiao et al. (2019) 97.45% -
CharRep 96.17% 75.79%
Ours 96.05% 80.25%
Ours + CharRep 97.42% 86.28%
Table 8: More experimental results in the non-zero-shot settings of handwritten and scene characters. “CharRep” represents that we use the character-level representation to train or fine-tune our model. We achieve the comparable performance in the non-zero-shot settings.

Appendix F Size of Confusable Character Set

As shown in Fig. 13, compared with Chen et al. (2021a), our method contains a smaller confusable set since we take both radical structures and strokes into consideration for representing Chinese characters. Among 3,755 Level-1 characters, there are 280 confusable characters for the stroke-based approach and only 111 confusable characters for our radical-structured tree representation.

Figure 13: Compared with the stroke-based representation, the proposed RSST representation can mitigate the one-to-many problem.

Appendix G Weighted Edit Distance for Radical-Structured Stroke Trees

In this section, we introduce the design of the proposed Weighted Edit Distance. Inspired by HDE Cao et al. (2020), given the predicted radical-structured stroke tree, we assign a hierarchical attenuation weight to each node in the tree. Specifically, the weight of nodes in the -th layer is , where is empirically set to 0.5. For the stroke nodes (i.e., the leaf nodes), we evenly distribute the weight of nodes to each stroke. We define as the distance between the first elements of the predicted tree and the first elements of the candidate tree in the depth-first-search sequence. When calculating , the smallest of the following three cases is selected as the final value: (1) The sum of and the weight of the -th element in . (2) The sum of and the weight of the -th element in . (3) The sum of and the weighted edit distance between the -th element in and the -th element in , which is computed by:

(8)

where is the weight of the -th element of . and denote the vanilla edit distance and the length of sequence. Therefore, we define the state transition equation for dynamic programming as follows:

(9)

where is the weight of the -th element of . An example of calculating the Weight Edit Distance is shown in Fig. 14.

Figure 14: An example of calculating the Weight Edit Distance. is the Weight Edit Distance between the prediction and the candidate