Chinese characters are among the oldest written languages in the world, which nowadays have been widely used in many Asian countries such as China, Japan, and Korea. Although handwritten Chinese characters can be well recognized by most humans, it remains a challenging issue for computers due to the complex shape structures, a large number of category, and great writing style variations. Therefore, automatic online handwritten Chinese character recognition (OLHCCR) has been widely studied in the past decades for the development of human intelligence.
Traditional methods for OLHCCR generally require to design effective hand-crafted features, upon which classification is performed via traditional machine learning algorithms like the modified quadratic discriminate function (MQDF)
. With the great success of deep learning techniques, an overwhelming trend is that deep neural networks have gradually dominated the field of OLHCCR[28, 30, 31], outperforming traditional methods with a large margin.
The latest most popular choice for OHLCCR is to adopt the convolutional neural network (CNN) on feature images [28, 30], which converts the problem of character recognition into the image classification. To adopt 2D-CNN for OLHCCR, the most intuitive way is to render handwriting trajectories as static images. To further incorporate the temporal information of handwriting, it is important to utilize the domain-specific knowledge for extracting directional feature images , which improves the recognition accuracy. However, we argue that it is not an ideal solution to represent handwriting as images. Specifically, different from natural images with varied colours and rich content, the character images are typically black-and-white binary-valued, which only provide response values for strokes while leaving large blank areas as background. Therefore, such character images typically contain significant redundant information than natural images. Moreover, transforming 1D trajectories into 2D images increases the data dimension, and thus it may result in huge computational consumption and heavy demand for extra parameters.
Recent advances [31, 4] show that OLHCCR also can be effectively addressed by applying the recurrent neural network (RNN) or 1D-CNN on temporal trajectories, avoiding extracting image-like representations. However, the RNN remains less amenable to parallelization and suffers low computational speed due to its recurrent computation mechanism; yet the 1D-CNN requires to stack extremely deep layers for learning long-term dependencies of long sequences due to the locality of convolutions. Moreover, since both networks only focus on exploiting temporal information of sequences, it makes them unsuitable for the scenarios that temporal information is lacked or disturbed.
Instead of viewing characters as either images or trajectories, here we propose to represent characters as geometric graphs, naturally retaining both spatial structures and temporal orders. Accordingly, we propose a novel spatial graph convolution network (SGCN) to effectively classify those unstructured graphs as shown in Fig. 1. Particularly, we demonstrate that the graph representation has its unique advantages: (1) compared with images, graphs provide a similar natural visual representation, yet graphs keep the essential information with more compact forms. This may lead to much lower computation complexity and fewer model parameters when performing convolutions; (2) compared with trajectories, graphs can store the temporal orders of sequences within the directed edges, but graphs further explicitly reveal the geometric structures. As a result, when the temporal information is lacked, graphs are more discriminative than trajectories and thus can be applied to the more strict scenarios.
Our contributions are summarized as follows:
We propose a compact and efficient geometric graph representation for online handwritten characters, which naturally retains the spatial structures and temporal orders by viewing characters as graphs.
We propose a novel spatial graph convolutional network (SGCN) for OLHCCR for the first time. The proposed method is largely different from the latest popular methods including the 2D-CNN on feature images, and the LSTM or 1D-CNN on temporal trajectories.
The proposed graph-based architecture is fully end-to-end, requiring no human efforts once graphs are constructed. Moreover, it is the first time to verify the effectiveness of SGCN for the large category pattern recognition (nearly four thousands different classes).
Experiments are conducted on benchmark datasets (including IAHCC-UCAS2016, ICDAR-2013, and UNIPEN), which demonstrate that the SGCN is very competitive with the state-of-the-art methods for OLHCCR.
2 Related Work
Online Handwritten Chinese Character Recognition
OLHCCR has attracted great interests over the past decades, and tremendous achievements have been witnessed in recent years. Specifically,  proposed a standard machine learning-based framework for OLHCCR, which first extracts hand-crafted features from the normalized characters and then performs classification with the modified quadratic discriminate function (MQDF). As an alternative to MQDF,  recently introduced the sparse representation based classification for air-written Chinese characters.
With the great impact of deep learning, the CNN has been gradually applied for OLHCCR , which regards character recognition as image classification. Furthermore,  proposed to incorporate traditional domain-specific knowledge for extracting feature images, thus further improving the recognition performance of CNN. However, this method faces two problems: (1) the complex domain-specific knowledge for extracting feature images and (2) demand for huge computation and large parameters after increasing the data dimension.
Instead of transforming online trajectories into image-like representations,  proposed to directly apply the recurrent neural network (RNN) on temporal trajectories, avoiding the complex domain-specific knowledge for extracting feature images. Unfortunately, the RNN remains less amenable to parallelization and suffers low computational speed for long sequences. Instead,  proposed to directly classify temporal trajectories with the 1D-CNN, which empirically runs faster than the RNN for long sequences. However, to exploit long-term dependencies of sequences, the 1D-CNN requires to stack extremely deep layers due to the locality of convolutions. Moreover, both the RNN and 1D-CNN only focus on sequential learning, and thus they are unsuitable for the scenarios that temporal information is lacked (such as the offline HCCR).
Rather than viewing characters as static images or temporal trajectories, here we propose to represent characters as geometric graphs. Accordingly, a spatial graph convolutional network (SGCN) is proposed for character recognition for the first time, addressing OLHCCR in a novel perspective.
Convolution on Graphs
Although the traditional CNN has achieved great success in many domains, it is limited to processing images with regular grid structures. Instead, recent advances have generalized convolutions on irregular unstructured graphs, which can be divided into two broad categories: the spectral methods [14, 2, 29] and spatial methods [17, 3, 26]
. The spectral methods typically apply the Fourier transform on graphs, converting the graph convolution into multiplication in the spectral domain. However, the spectral methods require the input graphs to have the identical structures; and worse still, the spectral convolution only encodes the connectivity of nodes but ignores their geometric information. On the contrary, the spatial methods perform convolutions to aggregate local neighbourhoods of graph nodes in the spatial domain via the weighted sum, where the edge weights are dynamically calculated depending on the local geometry. Therefore, our work follows the spirit of spatial graph convolutions to fully exploit the spatial geometric properties of character graphs. Moreover, we are among the first to verify the effectiveness of spatial graph convolution for the large category pattern classification (which involves nearly four thousand classes).
Our goal is to represent online handwritten characters as sparse geometric graphs and then perform convolutions on the constructed graphs for the final classification. The overview of the spatial graph convolutional network (SGCN) for character recognition is illustrated in Fig. 1, and the details of each part will be described in the following section.
3.1 Graph Construction
Handwriting typically contains rich diversity of writing styles, and it also suffers from large variations in spatial sizes as well as locations. Therefore, it is important to normalize handwritten characters to reduce those variations for extracting reliable features. Similar to , each character is firstly normalized into a standard -coordinate system with its shape unchanged. Moreover, each handwriting stroke is further re-sampled into the same interval.
After the normalization, we obtain a point sequence of the length with absolute coordinates as . As shown in Fig. 2, we propose to construct a direct geometric graph from the given point sequence of length , where denotes the node-set, denotes the edge-set, and denotes the coordinate-set of all nodes. In the graph, each point is corresponding to a node entity , and the node-set includes all the sampling points in the given sequence, where . Accordingly, the absolute coordinate of each node is stored in , where . Moreover, the edge-set is constructed depending on the node connectivity, where denotes that points to and . Additionally, in a sparse graph. Lastly, the remaining issue is how to define effective features for each node , and we will discuss this next.
3.2 Feature Extraction Module
Once the directed graph is constructed from the trajectory, the spatial structure of character is stored in the position-set and the temporal information is embedded in the set of directed edges . Intuitively, we define features of node to retain both spatial and temporal information, i.e.,
where denotes the absolute coordinate, denotes the writing offset, and denotes the writing direction of node . Particularly, we regard as the spatial features and as the temporal features.
It should be noted that all those features can be computed from the directed graph with differentiable operations, and thus we can integrate the feature extraction into the back-propagation procedure rather than the pre-processing stage. After that, we add self-connections for all the nodes and further transform the directed graph into the undirected one. This eventually leads the graph convolution to fully incorporate information from the node and all its neighbourhoods.
3.3 Spatial Graph Convolution
Formally, the convolution at the position of a 2D grid image (or a single feature map) can be defined as
where denotes the local region that centred around the position of , denotes the coordinate offset to the center position , and denotes the convolutional kernel. As shown in Fig 3 (left), when is represented as the irregular 2D grid structure (i.g. an image), the convolution kernel is easy to be implemented with a learnable matrix, where is well-known as the kernel size.
However, it is not straightforward to directly apply the convolution to the unstructured graphs, since the structures of local regions at different positions in a graph can be largely different (i.g. Fig 3 (middle)). Considering that the convolution aggregates information from the local neighbourhoods, the convolution at the node of graph can be defined as
where denotes features of the node , , denotes the neighbourhood nodes of , and denotes the coordinate offsets from to . Particularly, the term normalizes the cardinality of the corresponding subset. In graphs, the coordinate offsets can be any possible values at different local regions; therefore, the remaining issue is how to define the spatial convolution kernel for handling the irregular local structures.
Institutively, the basic idea is to fit a continuous function (i.e. a curve surface) depending on the distribution of to approximate the convolution kernel , where . Fortunately, many spatial convolution kernels recently have been proposed [17, 3, 26], and among them,  is demonstrated to perform the best in many scenarios. Hence, the spline graph convolution kernel (as shown in Fig 3 (right)) is adopted in our work, which can be formally defined as
where is the Cartesian product of the -spline bases, is the -spline function, and is the trainable parameters to control the height of -spline surface. More details of -spline convolutional kernel can refer to .
3.4 Spatial Transform Network
Since the geometric structure of graph is corresponding to the absolute coordinates of all the points, the node features as well as graph convolutions are not invariant to the certain geometric transformations of graphs (i.g. the rotation, scaling, and translation). Inspired by , the spatial transform network (STN) is utilized to align the input graph into a canonical coordinate system before feature extraction.
Specifically, we first encode the geometric structure
into a high-dimensional feature vectoras
wherebased on the embedded vector . In our task, we further constrain to be the similarity transform for simplicity and stability, i.e.,
where denotes the rotation angle, denotes the scale, and denotes the transition of -coordinates, which can be calculated as
Therefore, the aligned geometric structure is computed as
Furthermore, This idea can be further extended to the node features as
where reshapes the 1-dimensional vector into a square matrix, and the feature transform matrix is not constrained here. Ideally, this STN aligns the intermediate features into a latent canonical space and thus helps learn the geometric invariant features, essentially benefiting the final classification.
3.5 Hierarchical Residual Structure
Traditional graph neural networks (GNNs) are inherently flattened and cannot learn the hierarchical representation of graphs [2, 15], since GNNs (i) lack the effective and efficient pooling for coarsening unstructured graphs and (ii) also suffer from the gradient vanishing and over-smoothing problems when stacking more layers. As a result, this limitation makes GNNs especially problematic for graph classification (which associates a label with an entire graph). To address this problem, a hierarchical residual structure is adopted for our SGCN, which is capable of fully incorporating the local neighbourhood information and exploiting the global shape properties. To this end, the following two key ideas are introduced for our SGCN:
Cluster-based Pooling  derives a clustering on all the graph nodes and then aggregates the nodes of the same cluster with new computed coordinates, which eventually results in a coarsen graph.
As shown in Fig. 4
(a), the proposed SGCN follows the standard hierarchical residual structure of conventional deep CNNs, where the figure (b) details the spatial transform network (STN) and (c) details the residual graph convolutional block (Rs-GCB). Additionally, each convolutional layer follows with the batch normalization and PReLU  for better convergence, and the dropout  is also utilized for good generalization. Finally, the SGCN is trained end-to-end by minimizing the normalized cross-entropy .
3.6 Complexity Analysis of Graph Convolution
Here we compare the computation complexity of convolutions on images and graphs for the task of OLHCCR. Generally, both images and graphs can be represented as graphs with their nodes and edges, i.e., . Then, the complexity of a single channel convolution on a graph should be , where denotes the average number of edges for each node, i.e., . Specifically, for an image with the height and weight , we get and (i.e. eight neighbourhoods and itself); for a sparse character graph, we can safely assume that (i.e. the former point and subsequent points in temporal order and itself), since most of the nodes are not the intersection points. As a result, the convolution complexity comparison between the image and graph should be , where denotes the node-set of the graph. Empirically, to achieve comparable accuracy for OLHCCR, the resolution of an input image is typically fixed to , while a character graph only contains nearly 100 nodes on average. On this condition, the convolution complexity of images is much more expensive than that of character graphs.
Online handwritten Chinese character datasets are as follows:
IAHCC-UCAS2016  is a public in-air handwritten Chinese character dataset, where each character is written in the midair within a single stroke. The dataset contains totally 431,825 samples covering 3755 Chinese characters (level-1 set of GB2312-80), where each class contains 115 different samples. Similar to previous works , 92 samples per class are chosen as the training set, and the remaining as the test set.
ICDAR-2013  is the most popular dataset of online handwritten Chinese characters that collected by the CASIA institution. For the OLHCCR task, the sub-datasets CASIA-OLHWDB 1.0 & 1.1 are used as the training set, which contains 2,693,183 samples; and the ICDAR-2013 competition sub-dataset is used as the test set, which contains 224,590 samples of 3755 character classes (level-1 set of GB2312-80).
4.2 Implementation Details
The whole architecture is entirely based on the PyTorch deep learning platform. In experiments, the detailed configuration of our spatial graph convolutional network for OLHCCR is shown in Fig. 4
, where the kernel size of each spline convolution is set as 3. Moreover, the dropout probability is set as 0.2 for each “Rs-GCB” module during training. The whole network is optimized via ADAM algorithm with a batch size of 128. Furthermore, the initial learning rate is set at 0.002 and then decayed by 0.1 when the performance stops improving. Finally, the training process is terminated when the model reaches the convergence. All the experiments are conducted on a Dell workstation with an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 32 GB RAM, and two NVIDIA Quadro P5000 GPUs.
4.3 Results with Varying Depths & Widths
To investigate the effectiveness of the proposed method for OLHCCR, we fully compare the performance of SGCNs with varying depths and widths as shown in Table 1. Additionally, we mainly search different network configurations on the small dataset (i.g. IAHCC-UCAS2016), hoping that the well-performed SGN architecture can be successfully transferred to the large dataset (i.g. ICDAR-2013). Specifically, during the comparison, different SGCNs follow the same structure as shown in Fig. 4 but with different numbers of convolutional layers and channels in each “Rs-GCB” module. For example, “SGCN-#3” in Table 1 is corresponding to the SGCN in Fig. 4, and “SGCN-#4” expands the width of “SGCN-#3” by 1.5. As shown in Table 1, either increasing the network depth or expanding its width can improve the recognition accuracy, however, both strategies correspondingly demand much more parameters and training time. Moreover, this accuracy improvement tends to become marginal when the model is sufficiently deep. In general, we prefer to choose the “SGCN-#3” and “SGCN-#4” for OLHCTR as the trade-off among the accuracy, storage, and training time.
4.4 Ablation Study
To fully analyze the effectiveness of each part in SGCN, we conduct the detailed ablation study of “SCGN-#3” on IAHCC-UCAS2016 dataset as shown in Table 2.
|HMM + DTW||97.1||92.8||90.7|
Node Feature Analysis
We first analyze the effectiveness of different node features. As shown in Table 2, if no features are provided for graph nodes (i.e. features of all nodes are set into the same value), the SGCN still achieves a slightly good accuracy. This indicates that the graph representations well retain the geometric properties of characters, and the spatial graph convolution can also exploit their spatial structures via the local neighbourhoods aggregation and hierarchical structure learning. Moreover, the recognition accuracy will increase if we explicitly assign either spatial features (i.e. the absolute coordinates) or temporal features (i.e. writing offsets & directions) to the graph nodes. Finally, the SGCN achieves the best result by combining both spatial and temporal features, demonstrating the necessity of feature extraction.
Spatial Transformation Analysis
We further analyze the effectiveness of spatial transformation (ST) for inputs and features. We notice that ST for inputs does not bring a noticeable accuracy improvement for character recognition, which reveals that (1) variations in scales and positions can be effectively addressed by pre-processing and (2) the SGCN is robust to the small rotations of characters. On the contrary, a significant accuracy increase is observed when the ST is adopted in the feature space. This may indicate that the ST transforms high dimensional features into the latent canonical forms, and thus it helps learn the geometric invariant features. This eventually benefits the final classification.
Extension to Digits and Letters
To demonstrate that the SGCN is language-independent, we evaluate the SGCN on the UNIPEN dataset , which consists of the isolated digits (1a), upper case (1b) and lower case English letters (1c) respectively. Moreover, the detailed configuration of SGCN for UNIPEN is “STN FeatLayer Rs-GCB(32) Pooling STN Rs-GCB(64) Pooling Global-AverageFC(128) Softmax”. As shown in Table 3, the SGCN achieves comparable accuracies compared with previous methods on the UNIPEN datasets. This indicates that the SGCN is not limited to Chinese characters and it also can be extended to other languages easily.
4.5 Benchmarking Results
Lastly, we fully compare the SGCN with previous methods on benchmark datasets (including IAHCC-UCAS2016 and ICDAR-2013) as listed in Table 4 and Table 5. Particularly, we notice that the SGCN on graphs requires much fewer parameters than the 2D-CNN on feature images to achieve comparable accuracy. Moreover, the SGCN also achieves comparable accuracies and similar storages compared with the 1D-CNN or RNN on temporal features; however, the accuracy of the SGCN does not heavily rely on the temporal information (as shown in Table 2), while the 1D-CNN and RNN will not work for classification when temporal orders of trajectories are lacked or disturbed. Overall, experiments demonstrate that the SGCN is comparable with the state-of-the-art methods, yet the SGCN not only has its unique advantages but also addresses OLHCCR in a completely different perspective .
In this paper, we have proposed a novel spatial graph convolutional network (SGCN) for OLHCCR, upon which characters are viewed as geometric graphs. This is largely different from the latest popular methods including the 2D-CNN, 1D-CNN, and LSTM, thus addressing character recognition in a novel perspective. Experiments on benchmarks demonstrate the effectiveness of SGCN for OLHCCR. In future work, we plan to (1) design new graph convolution kernels for better performance and (2) extend our SGCN to offline HCCR.
-  (2004-03) The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (3), pp. 299–310. Cited by: Table 3.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §2, 1st item, §3.5.
SplineCNN: fast geometric deep learning with continuous B-spline kernels.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 869–877. Cited by: §2, §3.3.
-  (2019) A new perspective: recognizing online handwritten Chinese characters via 1-dimensional CNN. Information Sciences 478, pp. 375–390. Cited by: §1, §2, Table 4, Table 5.
-  (2020) Compressing the CNN architecture for in-air handwritten Chinese character recognition. Pattern Recognition Letters 129, pp. 190 – 197. Cited by: Table 4.
-  (1994) UNIPEN project of on-line data exchange and recognizer benchmarks. In International Conference on Pattern Recognition, Cited by: §4.4.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In International Conference on Computer Vision, pp. 1026–1034. Cited by: §3.5.
-  (2016) Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: 2nd item.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: Table 3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.5.
-  (2019) Dynamic weight alignment for temporal convolutional neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3827–3831. Cited by: Table 3.
-  (2017-06) Multi-language online handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1180–1194. Cited by: Table 3.
-  (2015) Adam: a method for stochastic optimization. In International Conference for Learning Representations, Cited by: §4.2.
-  (2017) Semi-supervised classification with graph convolutional networks. International Conference for Learning Representations. Cited by: §2.
-  (2019) DeepGCNs: Can GCNs go as deep as CNNs?. In IEEE International Conference on Computer Vision, pp. 9267–9276. Cited by: 2nd item, §3.5.
-  (2013) Online and offline handwritten Chinese character recognition: benchmarking on new databases. Pattern Recognition 46 (1), pp. 155–162. Cited by: §1, §2, Table 4, Table 5.
-  (2015) Geometric deep learning on graphs and manifolds using mixture model CNNs. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2, §3.3.
Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration, May 2017. Cited by: §4.2.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §3.4.
-  (2018) Data augmentation and directional feature maps extraction for in-air handwritten Chinese character recognition based on convolutional neural network. Pattern Recognition Letters 111, pp. 9–15. Cited by: 1st item, Table 4.
-  (2018) In-air handwritten Chinese character recognition with locality-sensitive sparse representation toward optimized prototype classifier. Pattern Recognition 78, pp. 267–276. Cited by: §2, 1st item, Table 4.
-  (2019) Recognizing online handwritten chinese characters using rnns with new computing architectures. Pattern Recognition 93, pp. 179 – 192. Cited by: Table 4, Table 5.
-  (2017) An end-to-end recognizer for in-air handwritten Chinese characters based on a new recurrent neural networks. In International Conference on Multimedia and Expo, pp. 841–846. Cited by: Table 4.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.5.
Cosface: large margin cosine loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §3.5.
-  (2019) Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 146. Cited by: §2, §3.3.
-  (2016) Drop Sample: a new training method to enhance deep convolutional neural networks for large-scale unconstrained handwritten Chinese character recognition. Pattern Recognition 58, pp. 190–203. Cited by: Table 5.
-  (2013) ICDAR 2013 Chinese handwriting recognition competition. In International Conference on Document Analysis and Recognition, pp. 1464–1470. Cited by: §1, §1, §2, 2nd item, Table 5.
Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting.
Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §2.
-  (2017) Online and offline handwritten Chinese character recognition: a comprehensive study and new benchmark. Pattern Recognition 61, pp. 348–360. Cited by: §1, §1, §2, Table 5.
-  (2018) Drawing and recognizing Chinese characters with recurrent neural network. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 849–862. Cited by: §1, §1, §2, §3.1, Table 5.