A High-Performance CNN Method for Offline Handwritten Chinese Character Recognition and Visualization

12/30/2018 ∙ by Pavlo Melnyk, et al. ∙ 0

Recent researches introduced fast, compact and efficient convolutional neural networks (CNNs) for offline handwritten Chinese character recognition (HCCR). However, many of them did not address the problem of the network interpretability. We propose a new architecture of a deep CNN with a high recognition performance which is capable of learning deep features for visualization. A special characteristic of our model is the bottleneck layers which enable us to retain its expressiveness while reducing the number of multiply-accumulate operations and the required storage. We introduce a modification of global weighted average pooling (GWAP) - global weighted output average pooling (GWOAP). This paper demonstrates how they allow us to calculate class activation maps (CAMs) in order to indicate the most relevant input character image regions used by our CNN to identify a certain class. Evaluating on the ICDAR-2013 offline HCCR competition dataset, we show that our model enables a relative 0.83 same computational cost compared to the current state-of-the-art single-network method trained only on handwritten data. Our solution outperforms even recent residual learning approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid development of deep learning technologies, many tasks regarding pattern recognition have obtained considerable improvements. The tasks vary significantly from object detection and image generation to spinning articles and generating poetry. The problem of text recognition is also a good example of learning discriminative representations performed by deep learning algorithms.

Text recognition at the character level can be divided into printed and handwritten character recognition. Automatic recognition of medical forms and processing of other types of files, such as administrative, postal mail sorting automation, and bank checks identification are all examples of applications for handwritten character recognition. It may further be either offline or online.

In this regard, the problem of offline handwritten Chinese character recognition (HCCR) (Liu et al. 2013) having been studied for more than half a century is of particular interest. Earlier successful methods for offline HCCR such as modified quadratic discriminant functions (MDQF) (Kimura et al. 1987) were effectively and significantly outperformed by convolutional neural network (CNN) approaches. Noteworthy, these days some hybrid methods such as ones utilizing adversarial feature learning (Zhang et al. 2018)

or an attention-based recurrent neural network (RNN) for iterative refinement of the predictions

(Yang et al. 2017) seem to be the next effective substitution for the traditional CNN solutions (Cireşan et al. 2012; Yin et al. 2013; Zhong et al. 2015; Zhang et al. 2017).

However, in our study we create a method based on a pure CNN architecture with a high recognition performance, while keeping in mind its size and computational cost. Notwithstanding that different data augmentation as well as feature handcrafting and spatial-transforming techniques were successfully utilized for offline HCCR, we refrain from using such in order to focus our work mainly on finding optimal hyperparameters for training only on the raw handwritten input data.

One of the main reasons causing shortcomings of the CNNs is the network interpretability (Qin et al. 2018). This question is especially interesting for us in terms of such a large-scale classification problem as offline HCCR. In this domain, both low-level visual features such as small strokes and their high-level structural concatenations are important for making correct predictions (Yang et al. 2017).

In order to address this issue, we adopt the knowledge of class activation maps (CAMs) (Zhou et al. 2016). We demonstrate how it improves the network interpretability by performing visualization of the most relevant character parts learned by it. Unlike the visualization of the network layer outputs as was done in context of offline HCCR by Zhang (2015), exploiting CAMs allows to understand the process from the beginning to the end.

The main contributions of our work are summarized as follows: 1) we propose a CNN model for offline HCCR, which achieves a state-of-the-art accuracy for single-network methods trained only on handwritten data; 2) we employ modified versions of global average pooling (GAP) - global weighted average pooling (GWAP) and the introduced global weighted output average pooling (GWOAP) to obtain a high performance and accomplish the visualization.

The rest of the paper is organized as follows: Sect. 2 reviews related research; Sect. 3 describes the proposed architecture and its effectiveness, introduces our modification of GWAP - GWOAP, and also details how CAMs can be computed when a network is equipped with either; whilst Sect. 4 shares the implementation details and the results of our experiments including a comparison with other methods; Sect. 5 summarizes our work.

2 Related Work

2.1 Offline HCCR

The reasons why HCCR is a non-trivial problem can be mainly formulated as follows:

1) writing variations;

2) wide-scale vocabulary - the number of characters classes ranges from 6763 to 70244 in GB2312-80 and GB18010-2005 standards, respectively;

3) similarities between Chinese characters.

Nowadays advances in deep learning enable researchers to successfully utilize CNNs in HCCR domain (Zhong et al. 2015; Cheng et al. 2016; Zhong et al. 2016; Xiao et al. 2017; Li et al. 2018), greatly outperforming MDQF methods (Kimura et al. 1987; Lu et al. 2015). First such a CNN was applied to this problem by Cireşan et al. (2012). Their single multi-column deep neural network (MCDNN) achieves a 94.47% accuracy.

Later works were evaluated on the ICDAR-2013 competition (Yin et al. 2013) dataset containing 3755 character classes which corresponds to the key official character set GB2312-80 level-1.

There is a very noticeable trend in the offline HCCR competition: the better deep CNNs perform, the more different aspects researchers consider for their models. For instance, the Fujistu research team created a CNN-based method and took the winner place in the ICDAR-2013 competition with an accuracy of 94.77% (Yin et al. 2013), while requiring as much as 2460MB for storage.

The first model that outperformed human-level performance was introduced by Zhong et al. (2015), which incorporated traditional directional feature maps. Therefore, their single HCCR-Gabor-GoogLeNet and ensemble HCCR-Ensemble-GoogLeNet-10 models achieve a recognition accuracy of 96.35% and 96.74%, and have a size of 27.77MB and 270.0MB, respectively.

Cheng et al. (2016) showed how the combination of the character classification and similarity ranking supervisory signals increases inter-class variations and reduces intra-class variations. Therefore, their single deep CNN achieves a 97.07% accuracy taking 36.80MB for storage. The ensemble of four such networks has a better performance of 97.64%.

Zhong et al. (2016)

introduced a network composed of two parts: spatial transformer network for rectifying the input image and deep residual network to predict the label distribution for the rectified image, which resulted in an accuracy of 97.37% with a 92.30MB storage required.

Zhang et al. (2017) used the traditional normalization-cooperated direction-decomposed feature map (DirectMap) along with the deep CNNs to obtain accuracy of 96.95%, further improved to 97.37% by introducing adaptation layer aimed at reducing the mismatch between training and test samples on a particular source layer. Both models have a size of 23.50MB. It takes 1.997ms to calculate DirectMap and 296.894ms to perform a forward pass of a deep CNN for processing a character image on a CPU.

A method using residual blocks and iterative model prediction refinement by means of an attention-based RNN is a hybrid approach proposed by Yang et al. (2017). They achieved an accuracy of 97.37%, outperforming previous methods that used raw input data.

A fast and compact CNN was developed by Xiao et al. (2017) with a speed of 9.7ms/char on a CPU but only 2.3MB of storage needed and an accuracy of 97.09%. That was enabled by employing global supervised low-rank expansion (GLSRE) and adaptive drop-weight techniques (ADW). In their experiments, one of the baseline models with a size of 48.7MB yielded a state-of-the-art accuracy of 97.59%, considering single-network methods trained only on handwritten data.

Another well-balanced between the speed, size and performance network was recently introduced by Li et al. (2018)

. Their cascaded single-CNN model takes only 6.93ms to classify a character image on a CPU, and achieves an accuracy of 97.11% requiring only 3.3MB for storage. They accomplished this by utilizing fire-modules and the proposed novel GWAP concept along with quantization.

One of the newest methods reported by Zhang et al. (2018) introduced adversarial feature learning (AFL), which significantly outperforms traditional deep CNN approaches by exploiting writer-independent semantic features with the prior knowledge of standard printed characters, resulting in a 98.29% test set accuracy and an 18.2MB model size.

2.2 Class Activation Maps

Zhou et al. (2016) presented a method of generating CAMs showing how GAP proposed by Lin et al. (2013) enables a CNN trained for the object recognition task to perform object localization. Such a technique allows to indicate the most important for classification regions of an input image. The main idea behind lies in a basic knowledge of a CNN structure: as we move deeper, the height and width of feature maps shrink, while the number of channels increases.

GAP used instead of the traditional fully connected layer at the end of the network produces the spatial average of every channel of the preceding convolution layer output. Later, the weighted sum of these values is used in order to generate final output – perform a logistic regression. Remarkably, it is easily interpretable: one can think of a feature going into the logistic regression as a value indicating whether or not something important for classification appears in the image.

Similarly, a CAM is a weighted sum of the GAP input (the last convolution layer output), i.e., if we were to look at the image before the spatial averaging, we would know where exactly a distinctive region was. It is worth mentioning, that we consider only one class when producing a CAM – the predicted class.

Let the output of the last convolution layer be a 3-D tensor

, the output of the GAP be a vector

, and the logistic regression weight matrix be . In order to calculate the activation map, all one needs to do is weigh the importance of each feature of F by multiplying them by the corresponding elements of the column of that connects to the predicted class output:

(1)

where:

height, width and number of channels of the
feature map
total number of classes
the predicted class

One can notice that (1) is a dot-product between the -th weight vector of the matrix and the last conv-layer output feature map F. We can simply zoom the obtained CAM to the size of the input image, thus identify the image regions most relevant to the certain category.

Notably, this strategy for visualization is different from the one exploited by Yang et al. (2017). In their multi-scale residual block cascade, they introduced shortcut connections that aggregate “lower” and “higher” layers activations with different height and width but the same number of channels, and defined an aggregation operation as the union of feature vectors. The obtained in this way learned visual representation was proposed to be fed into the iterative refinement module to improve the classification performance.

3 Method Description

3.1 Proposed Architecture

Figure 1: Baseline architecture of the proposed network

Being inspired by the performance gain enabled by utilizing the modification of GAP - GWAP proposed by Li et al. (2018)

, we employ it and compare its performance with GAP and our modification – GWOAP. Thus, we could see how good a single deep CNN without residual connections can perform for the offline HCCR task and visualize the most distinctive regions of an input character image. The corresponding networks are further referred to as Model A, Model B, and Model C as shown in Table 

1. The description of GWAP and the proposed GWOAP is presented in Sect. 3.3.

Similar to the state-of-the-art method (Xiao et al. 2017), we resize input images to , as smaller samples result in a poorer accuracy, while bigger – in an expensive computational cost. Every conv-layer in our network has kernels of a

size, a stride parameter of 2, and retains spatial dimensionality of input – performs the “same” mode of padding. Batch Normalization (BN)

(Ioffe and Szegedy 2015) is a technique that was used as a default choice by many researchers over the past few years, and has been proven to be effective for offline HCCR (Xiao et al. 2017)

. Therefore, we equip every conv-layer with a BN-layer followed by a rectified linear unit (ReLU). Importantly, we do not use biases for the conv-layers, because it is redundant due to the presence of the second location parameters

in BN-layers.

The proposed baseline architecture (Model A) is presented in Fig. 1. It consists of 15 layers if counting only convolutional and fully-connected ones. We use average polling layers with windows and a stride parameter of 2. The easiest way to describe our model is in terms of convolutional blocks – groups of three convolutional layers with a bottleneck in the middle. The hyperparameters are shared between the three conv-layers, except for the number of kernels in the bottleneck. We discuss the effectiveness of such conv-blocks in the next subsection.

The first two conv-layers in our model are followed by a pooling layer, and then by 4 conv-blocks separated by pooling layers. Final conv-block produces a feature map of a size. Later, it is fed into GAP which outputs a vector of a 448 length. It is further connected to a 3755-softmax output, where the number of units corresponds to the number of character classes considered in this work.

Remarkably, smaller sizes of the last conv-layer output result in more blurry CAMs, as we need to upsample the maps to the size of the input. Through experiments, we observe that output feature maps and input images represent a well-balanced trade-off between the model performance and the visual clarity of the obtained CAMs.

Layer
Name
Model A Model B Model C
Output
Shape
Input 96 x 96 grayscale image 96 x 96 x 1
Conv1 3 x 3 conv. 64, BN, ReLU 96 x 96 x 64
Conv2 3 x 3 conv. 64, BN, ReLU 96 x 96 x 64
AvgPool 3 x 3 avg-pool stride 2 48 x 48 x 64
Conv-Block1
3 x 3 conv. 96, BN, ReLU 48 x 48 x 96
3 x 3 conv. 64, BN, ReLU 48 x 48 x 64
3 x 3 conv. 96, BN, ReLU 48 x 48 x 96
AvgPool 3 x 3 avg-pool stride 2 24 x 24 x 96
Conv-Block2
3 x 3 conv. 128, BN, ReLU 24 x 24 x 128
3 x 3 conv. 96, BN, ReLU 24 x 24 x 96
3 x 3 conv. 128, BN, ReLU 24 x 24 x 128
AvgPool 3 x 3 avg-pool stride 2 12 x 12 x 128
Conv-Block3
3 x 3 conv. 256, BN, ReLU 12 x 12 x 256
3 x 3 conv. 192, BN, ReLU 12 x 12 x 192
3 x 3 conv. 256, BN, ReLU 12 x 12 x 256
AvgPool 3 x 3 avg-pool stride 2 6 x 6 x 256
Conv-Block4
3 x 3 conv. 448, BN, ReLU 6 x 6 x 448
3 x 3 conv. 256, BN, ReLU 6 x 6 x 256
3 x 3 conv. 448, BN, ReLU 6 x 6 x 448
GAP/
GWOAP/
GWAP
global avg-pool over
spatial dims.
global weighted output avg-pool
over spatial dims.
global weighted avg-pool
over spatial dims.
448
Output 3755-way Softmax 3755
Table 1: The three networks proposed for offline HCCR

3.2 Effectiveness of the Convolutional Block Bottleneck

The effectiveness of bottleneck layers in the proposed model for offline HCCR is proven empirically: they allow the network to retain its expressiveness while reducing the number of multiply-accumulate operations and the required storage.

Considering a single conv-layer, let and be the size of the input feature map, be the number of channels per input feature map, and be the size of the kernel of the conv-layer, , , be the number of kernels, the size of zero-padding, and the stride, respectively. Then the number of multiply-accumulations (MAC) for the conv-layer can be calculated as follows:

(2)

Assuming that all conv-layers retain the dimensionality of input (), the total number of MAC in this conv-block can be found in accordance with (2) as:

(3)

Let the middle layer of the conv-block be a bottleneck that outputs a volume. Then the number of MAC in such a conv-block can be calculated as:

(4)

Thus, the reduction in computation as well as in storage can be found as:

(5)

On the other hand, utilizing bottlenecks can be viewed as a compression-decompression operation, which is kind of regularization itself.

3.3 Obtaining Class Activation Maps with GWAP and GWOAP

GWAP (Li et al. 2018) can be simply expressed as:

(6)

The proposed modification of GAP is defined as:

(7)

where:

input feature map
3-D trainable kernel of GWAP
1-D trainable kernel of GWOAP

The difference between the two modifications is the number of parameters: GWOAP scales the output of spatial summation, rather than its input, which is more in the “convolutional” manner, i.e., sharing learnable scaling parameters channel-wise rather than shape-wise. It also can be seen as regularization. The process of obtaining CAMs for the network equipped with either GWAP or GWOAP is different from that with GAP (1) only by one additional operation (8). The output of the last conv-layer F is to be scaled by either or :

(8)
(9)

where:

upsampled 3-D version of
for performing a valid
3-D scaled feature map
CAM obtained 2-D class activation map

We compare how well both modifications of GAP perform in offline HCCR competition and discuss the CAMs produced by means of the proposed models for different character images in Sect. 4.4.

4 Experiments

In this section, we share the implementation details and demonstrate the effectiveness of the proposed models not only in terms of recognition performance, but also from the visualization perspective.

4.1 Datasets

In order to train the proposed networks, we use CASIA-HWDB1.0-1.1 (Liu et al. 2011) datasets collected by National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA), written by 420 and 300 persons, respectively.

The overall training dataset contains 2,678,424 samples belonging to 3755 different character classes. We evaluate our models on the most common benchmark for offline HCCR – the ICDAR-2013 competition dataset (Yin et al. 2013), containing 224,419 samples written by 60 persons.

It is worth mentioning that we do not use test set as validation data for finding hyperparameters. The validation set of a 60,000 samples size is randomly selected from the training data. After finding optimal settings for our models, we merge validation and training sets and conduct further experiments.

We use raw images: the only data preprocessing we make is normalization to size and inversion of the pixel intensity.

4.2 Training Strategy

First, we shuffle the training data. The parameters of all conv-layers are set with the He-Normal initialization (He et al. 2015)

. The classification layer weights are randomly initialized by drawing from a Gaussian distribution with the standard deviation of 0.001, while the bias term is initially set to 0. The parameters of GWAP and GWOAP are initialized with 1.

We use stochastic gradient descent (SGD) with the momentum term of 0.9 for training, which is a common choice for pretty much every CNN proposed for this competition over the past few years. The mini-batch size is set to be 256, and the maximum number of epochs is 40.

Exploiting batch normalization allows us to choose a higher learning rate value. Initially, we set learning rate to 0.1 and train our models for one epoch. Then we decrease it by factor of 10, and keep decreasing after every epoch when training accuracy stops improving.

To deal with overfitting for all models in our work, we use L2-regularization with the multiplier equal to 0.001, and a dropout (Srivastava et al. 2014)

before the softmax layer, where the probability of dropping

is set to 0.5. Also, we do not use any data augmentation method for generating distorted images during training.

We implement the proposed CNNs using the amazing Keras deep learning library

(Chollet et al. 2015)

with the TensorFlow

(Abadi et al. 2016) backend, and conduct all experiments on NVIDIA GeForce GTX 1080 Ti with 11GB of memory. A single experiment takes 2 days on average.

4.3 Results

The performances of the proposed models are demonstrated in Table 2.

Model Parameters Accuracy Accuracy Drop
A 6,507,691 97.38% 0.00045%
B 6,508,139 97.55% 0.00045%
C 6,523,819 97.61% 0.00045%
Table 2: Comparison of the proposed networks performance. The last column shows the absolute difference between a trained model accuracy and its accuracy when the input bias term for the softmax layer is reset to 0

Obtaining CAMs with Model A is pretty straightforward, since it is equipped with GAP, and can be done in accordance with (1). As for Model B and Model C, we use the additional operation (8). Similar to the original source (Zhou et al. 2016), we ignore the bias of the softmax as it has little to no impact on the classification accuracy as shown in the right-most column in Table 2. After that, we upsample the obtained maps to the input image size, and plot them together with the input to visualize its most relevant regions. The CAMs produced by means of each model are shown in Fig. 2. The first two rows display one of the most confusing handwritten Chinese characters pairs – ”已” (yi) and ”己” (ji). The last two rows show the most distinctive parts of ”鲍” (bao) and ”吗” (ma).

Method Size (MB) Accuracy Ensemble Raw Data Ref.
Human Level Performance n/a 96.13% n/a n/a Yin et al. (2013)
HCCR-Gabor-GoogLeNet 27.7 96.35% no yes Zhong et al. (2015)
HCCR-GoogLeNet-Ensemble-10 270.0 96.74% yes (10) yes
Residual-34 92.2 97.36% no yes Zhong et al. (2016)
STN-Residual-34 92.3 97.37% no yes
DCNN-Similarity ranking 36.2 97.07% no yes Cheng et al. (2016)
Ensemble DCNN-Similarity ranking 144.8 97.64% yes (4) yes
DirectMap + ConvNet 23.5 96.95% no no
DirectMap + ConvNet + Ensemble-3 70.5 97.12% yes (3) no Zhang et al. (2017)
DirectMap + ConvNet + Adaptation 23.5 97.37% no no
M-RBC + IR n/a 97.37% no yes Yang et al. (2017)
HCCR-CNN9Layer+GSLRE 4X +ADW 2.3 97.09% no yes
HCCR-CNN12Layer+GSLRE 4X+ADW 3.0 97.40% no yes Xiao et al. (2017)
HCCR-CNN12Layer 48.7 97.59% no yes
Cascaded Model (Quantization) 3.3 97.11% no yes Li et al. (2018)
Cascaded Model 20.4 97.14% no yes
AFL 18.2 98.29% no yes Zhang et al. (2018)
Model A 24.8 97.38% no yes
Model B 24.8 97.55% no yes ours
Melnyk-Net (Model C) 24.9 97.61% no yes
Table 3: Comparison of the ICDAR-2013 offline HCCR competition methods. “Raw Data” shows whether manually preprocessed or raw input images are used. All displayed methods except for Human Level Performance include CASIA HWDB1.0-1.1 in training datasets

4.4 Comparison of the Proposed Models

Model C utilizing GWAP outperforms the other two. Compared to the baseline model (Model A), its number of parameters is bigger only by less than 0.25%, while it results in a relative performance gain of 0.24%. Despite that GWOAP has 36 times less parameters than GWAP, Model B is outperformed by Model C with a rather small margin. Nevertheless, as the result suggest, the more attention parameters for global spatial averaging at the end of the network we use, the better it performs on the unseen data.

It is very noticeable that not only do Model B and Model C yield the best classification performance, but also by using them we obtain a comparatively more accurate mechanism for visualization as demonstrated in Fig. 2. However, the assessment of CAMs in terms of HCCR is very subjective since there is no numerical measure for this unlike the object localization task.

Figure 2: Class activation maps for correctly classified character images and the respective softmax outputs. The first three columns correspond to class activation maps produced using deep features of Model A, Model B, and Model C, respectively. The last column contains original input images from the ICDAR-2013 competition dataset

4.5 Comparison with Other Methods

The comparison of the ICDAR-2013 offline HCCR competition methods is shown in Table 3.

Two of the proposed models, namely Model B and Model C, show a competitive with the state-of-the-art single-CNN method (Xiao et al. 2017) classification performance, while having the same computational cost of 1.2GFLOPs (multiply-accumulations) and being almost twice as small in size. But we do not use a fully connected layer before the classification layer, because it would not allow us to compute CAMs.

Remarkably, even our baseline network (Model A) of a 15-layers depth outperforms some recent approaches with residual learning (Zhong et al. 2016; Yang et al. 2017). Even though the model proposed by Zhong et al. (2016) uses spatial transformation of the input images at the network level. It is also worth mentioning that similar to our work, the method described by Yang et al. (2017) allows to visualize distinctive regions of input character images. However, it utilizes a multi-scale residual block cascade that learns a hierarchy of visual features from the input for iterative refinement of the predictions.

As for the cascaded model (Li et al. 2018) which uses GWAP, it is built with the efficiency criteria in mind in order to balance the accuracy, speed and number of parameters, and is not aimed to perform the visualization as we do.

Unlike Cheng et al. (2016), we do not use data augmentation to generate more samples. All our networks outperform their deep CNN with large margins requiring a 31.5% smaller storage. However, the ensemble of four such CNNs achieves a 0.03% relatively higher accuracy being almost 6 times bigger in size compared to the proposed Model C. It is included in our comparison to show a complete picture of the competition.

The AFL method (Zhang et al. 2018) that achieves the state-of-the-art result in offline HCCR competition is hard to compare to our work, since their model involves a discriminator guiding the feature extractor to learn the prior knowledge of standard printed characters. On the contrary, we use only handwritten data for training. Additionally, the feature extractor in their network is followed by a fully connected layer, which does not allow to utilize the visualization method exploited in our work.

5 Conclusion

In this paper, we propose a high-performance CNN architecture for offline HCCR. Model C performing the best among the proposed three networks is called Melnyk-Net. To the best of our knowledge, it yields a state-of-the-art accuracy for single-network methods trained only on handwritten data. Compared to the previous state-of-the-art model, Melnyk-Net is 0.02% more accurate, while having the same computational cost and requiring almost twice as small storage. We accomplish this by exploiting convolutional layers with bottlenecks and the variation of the global averaging operation. Importantly, Melnyk-Net being 15-layer-deep and having no residual connections outperforms recent ResNet-based methods. Moreover, we show how utilizing GAP and its modifications including the proposed GWOAP enables us to calculate CAMs in order to perform visualization of the most distinctive regions of an input character image. It improves the network interpretability and can be a good tool for classification error analysis in such a large-scale recognition problem as offline HCCR. In the future work, we plan to use novel comprising methods to reduce the model size.

Acknowledgements.
This work is supported by National Natural Science Foundation of China under Grant No. 61472123, Hunan Provincial Natural Science Foundation under Grant No. 2018JJ2064. We would like to express our gratitude to China Scholarship Council for giving the first author an opportunity to obtain Master’s degree at Hunan University under Chinese Government Scholarship.

References

  • Abadi et al. (2016)

    Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283

  • Cheng et al. (2016) Cheng C, Zhang XY, Shao XH, Zhou XD (2016) Handwritten Chinese character recognition by joint classification and similarity ranking. In: Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on, IEEE, pp 507–511
  • Chollet et al. (2015) Chollet F, et al. (2015) Keras. https://keras.io
  • Cireşan et al. (2012) Cireşan D, Meier U, Schmidhuber J (2012) Multi-column deep neural networks for image classification. arXiv preprint arXiv:12022745
  • He et al. (2015)

    He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034

  • Ioffe and Szegedy (2015) Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:150203167
  • Kimura et al. (1987) Kimura F, Takashina K, Tsuruoka S, Miyake Y (1987) Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (1):149–153
  • Li et al. (2018) Li Z, Teng N, Jin M, Lu H (2018) Building efficient CNN architecture for offline handwritten Chinese character recognition. International Journal on Document Analysis and Recognition (IJDAR) 21(4):233–240
  • Lin et al. (2013) Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:13124400
  • Liu et al. (2011) Liu CL, Yin F, Wang DH, Wang QF (2011) CASIA online and offline Chinese handwriting databases. In: Document Analysis and Recognition (ICDAR), 2011 International Conference on, IEEE, pp 37–41
  • Liu et al. (2013) Liu CL, Yin F, Wang DH, Wang QF (2013) Online and offline handwritten Chinese character recognition: benchmarking on new databases. Pattern Recognition 46(1):155–162
  • Lu et al. (2015) Lu S, Wei X, Lu Y (2015) Cost-sensitive MQDF classifier for handwritten Chinese address recognition. In: Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, IEEE, pp 76–80
  • Qin et al. (2018) Qin Z, Yu F, Liu C, Chen X (2018) How convolutional neural networks see the world—A survey of convolutional neural network visualization methods. Mathematical Foundations of Computing 1(2):149–180
  • Srivastava et al. (2014) Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958
  • Xiao et al. (2017) Xiao X, Jin L, Yang Y, Yang W, Sun J, Chang T (2017) Building fast and compact convolutional neural networks for offline handwritten Chinese character recognition. Pattern Recognition 72:72–81
  • Yang et al. (2017) Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Improving Offline Handwritten Chinese Character Recognition by Iterative Refinement. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE, pp 5–10
  • Yin et al. (2013) Yin F, Wang QF, Zhang XY, Liu CL (2013) ICDAR 2013 Chinese handwriting recognition competition. In: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on, IEEE, pp 1464–1470
  • Zhang et al. (2017) Zhang XY, Bengio Y, Liu CL (2017) Online and offline handwritten Chinese character recognition: A comprehensive study and new benchmark. Pattern Recognition 61:348–360
  • Zhang (2015) Zhang Y (2015) Deep convolutional network for handwritten Chinese character recognition. Computer Science Department, Stanford University
  • Zhang et al. (2018) Zhang Y, Liang S, Nie S, Liu W, Peng S (2018) Robust offline handwritten character recognition through exploring writer-independent features under the guidance of printed data. Pattern Recognition Letters 106:20–26
  • Zhong et al. (2015) Zhong Z, Jin L, Xie Z (2015) High performance offline handwritten Chinese character recognition using googlenet and directional feature maps. In: Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, IEEE, pp 846–850
  • Zhong et al. (2016) Zhong Z, Zhang XY, Yin F, Liu CL (2016) Handwritten Chinese character recognition with spatial transformer and deep residual networks. In: Pattern Recognition (ICPR), 2016 23rd International Conference on, IEEE, pp 3440–3445
  • Zhou et al. (2016) Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2921–2929