CNN for Offline Handwriting Chinese Character Recognition
Our Multi-Column Deep Neural Networks achieve best known recognition rates on Chinese characters from the ICDAR 2011 and 2013 offline handwriting competitions, approaching human performance.READ FULL TEXT VIEW PDF
The long-standing challenges for offline handwritten Chinese character
A well-known old adage says that "A picture is worth a thousand words!"...
Recent deep learning based methods have achieved the state-of-the-art
Recently, hidden Markov models (HMMs) have achieved promising results fo...
Recently, deep models have been successfully applied in several applicat...
Traditional methods of computer vision and machine learning cannot match...
As handwriting input becomes more prevalent, the large symbol inventory
CNN for Offline Handwriting Chinese Character Recognition
embody the current state of the art in stationary pattern recognition. They outperformed other methods on image classification, object detection, and image segmentation [1, 7]. Through output averaging, several independently trained deep NN (DNN) can form a Multi-Column DNN (MCDNN) with error rates 20-40% below those of single DNN .
In 2012, our MCDNN were the first to achieve first human-competitive performance on the famous MNIST handwritten digit recognition task, e.g., . Chinese handwriting, however, is much harder, as there are not only 10 classes (one for each digit), but 3755.
We use several MCDNN architectures to classify handwritten Chinese characters from the dataset used at ICDAR 2011 and 2013  competitions. All training was done prior to the competition deadline. An executable was submitted to the organizers, whose test set was released after the 2013 competition, allowing us to further verify our MCDNN.
Details can be found in the competition reports [8, 9]. The data consists of plain images (offline, no temporal information) of isolated Chinese characters (already segmented out from text). The test set was identical for both the 2011 and 2013 competitions. It contains 224419 characters written by 60 persons.
Dataset HWDB 1.1 contains characters written by 240 persons for actual training and by 60 for validation: 897758 and 223991 characters, respectively. Note that there are far more classes (3755) than samples per class (240+60).
Although Chinese has tens of thousands of different classes, HWDB 1.1  contains only the 3755 most frequent ones. They require more complicated graphics than the 26 classes of Latin letters. This requires bigger images. Our experience with handwritten digits and Latin letters [3, 6] tells us that a pixel rectangular image can show enough details for good recognition. After visual inspection of several Chinese characters rescaled to various sizes we decided on using pixel images. Scaling is done uniformly; the biggest dimension of each character determines the scaling factor. We also place scaled characters in the middle of pixel images, to allow for various deformations during the training process. Before resizing, we maximize input image contrast to get values from 0 to 255.
Our training and testing framework  is designed for already preprocessed data, that is, neither training nor testing involves preprocessing. Instead, dedicated Matlab programs are used to preprocess data whenever necessary. For Chinese characters, preprocessing is limited to rescaling the images to a fixed size, plus simple contrast maximization.
ICDAR requires executables, hence we rewrote preprocessing routines in C++, using the OpenCV library instead of writing a new scaling function. As we learned the hard way, however, Matlab and OpenCV scaling routines do not produce exactly the same results (Fig. 1
), despite using the same interpolation method. Characters in Fig.1 look alike, but the ones preprocessed in OpenCV are much grainier. Our executable also reversed the order of scaling and contrast maximization. As a consequence, our framework was trained and validated with one preprocessing routine, while the submitted executable used a different one. Since the feedback from the organizers matched our expectations, the problem was noticed only once the test data was released after the competition.
When we applied identical preprocessing for both training and test set, the test error was 4.21%, down from our original competition result of 5.58%. All our DNN and MCDNN had their error rates reduced, by up to 2%.
Since we also submitted an executable with the same flawed preprocessing to the 2011 competition using the same test data, we rechecked the 2011 result, and also got a 2.04% lower error rate (5.78% instead of 7.82%).
Despite flawed preprocessing we won the 2011 competition. But we lost the 2013 competition by 0.35%, coming in 2nd at 5.58% vs. 5.23%. With correct preprocessing, however, we get 1.01% absolute error rate reduction (a massive 19.3% in relative reduction) over the team which ranked first.
We train eight networks (Table 1
) on HWDB1.1. All networks have 11 layers, counting input and output layers. The number of maps per layer varies from 100 to 450. We also try two different sizes for the first fully connected layer. The last layer always has 3755 neurons, i.e. one per class. The last four nets are trained on the HWDB 1.1 training set, i.e., characters written by 240 persons. The first four nets are trained on characters written by all 300 persons associated with training and validation datasets.
We built nine MCDNN (Table 2) from the eight previously trained nets. Four of them are basic DNN with only one column. We submitted these simple DNN to the competition, too, because we were interested in their performance—initially we could not access the test set to check them by ourselves, but now we can list them for completeness. Before the deadline, we had to select two models as official competition candidates. Using the validation results, we chose MCDNN 2 and 8. They are also the best on the competition test set.
MCDNN always significantly improve over single DNN. The best MCDNN has 4.215% error, much lower than the best DNN error, 5.528%. This is an absolute reduction of 1.313% and a relative reduction of 23.75%, in line with our observations for other datasets .
The competition organizers experimentally measured human error rate as 3.87%. Our best MCDNN came close: 4.21% error. Considering the top ten predictions, this MCDNN also has a new record-breaking error rate of 0.291%, which will be important for more complex context-driven systems using linguistic models.
Despite its size, the best MCDNN can classify 45 characters per second on a single NVIDIA GTX 580. Running on all four cores of an Intel Core i5 2400 3.1GHz, the same MCDNN is 14.29 times slower, requiring 315ms per character. Further speedups can be obtained by optimizing the code for this particular problem or by using more GPUs and/or CPUs.
Although there are 3755 classes of handwritten Chinese characters, our MCDNN can classify them with almost human performance. They are nearly one fifth better than the best previous artificial method. Recognition speed on GPUs is high, and scales linearly with their number. A thorough error analysis by native speakers/writers (none of us speaks Chinese) could help to show if there is still room for improvement, or if the remaining errors are just due to illegible characters. Even without additional context-driven linguistic models (which will further reduce errors), our method is ready for practical applications.
This work was partially supported by the Supervised Deep / Recurrent Nets SNF grant, Project Code 140399.
Proceedings of International Joint Conference on Artificial Intelligence, pages 1237–1242, 2011.
Proceedings of Computer Vision and Pattern Recognition, pages 3642–3649, 2012.