This paper studies font recognition, i.e. identifying a particular typeface given an image of a text fragment. To apply machine learning to this problem, we require realistic text images with ground truth font labels. However, such data is scarce and expensive to obtain, since it requires a high level of domain expertise which is out of reach of most people. Therefore, it is infeasible to collect a sufficient set of real-world training images. One way to overcome the training data challenge is to synthesize the training set by rendering text fragments for all the necessary fonts. However, we must face the domain mismatch between synthetic and real-world text images (Chen et al. (2014)). Characters in real-world images are spaced, stretched and distorted in numerous ways. In (Chen et al. (2014)), the authors tried to overcome this difficulty by adding different degradations to synthetic data. In the end, introducing all possible real-world degradations into the training data is infeasible.
We address this domain mismatch problem in font recognition, by further leveraging a large corpus of synthetic data to train a Convolutional Neural Network (CNN), while introducing an adaptation technique based on Stacked Convolutional Auto-Encoder (SCAE) with the help of unlabeled real-world images. The proposed method reaches an impressive performance on real-world test images.
Our basic CNN architecture is similar to the popular ImageNet CNN structure in (Krizhevsky et al. (2012)), as depicted in Fig. 1. The numbers along with the network pipeline specify the dimensions of outputs of corresponding layers. When the CNN model trained fully on a synthetic dataset, it witnesses a significant performance drop when testing on real-world data, compared to when applied to another synthetic validation set. This also happens with other models such as in Chen et al. (2014), which uses training and testing sets of similar properties to ours. This alludes to discrepancies between the distributions of synthetic and and real-world examples.
Tradition approaches to handle this gap include pre-processing steps applied on the training and/or testing data (Chen et al. (2014)). The domain adaptation method in Glorot et al. (2011) extracts low-level features that represent both the synthetic and real-world data, based on a stacked auto-encoder (SAE). We extend the method in Glorot et al. (2011) to decompose the basic CNN layers into two sub-network parts. The first layers accounts for extracting low-level visual features shared by both synthetic and real-world data, and will be learned in a unsupervised way, using unlabeled data from both domains. The remaining layers accounts for learning higher-level discriminative features for classification. It will be trained in a supervised way on top of the first part, using labeled data from the synthetic domain only.
To train the first layers, we exploit a Stacked Convolutional Auto-Encoder (SCAE) (Masci et al. (2011)). Its first two convolutional layers have an identical topology to the first two layers in Fig. 1. Moreover, we set its first and second half to be mirror-symmetrical. The cost function is the mean squared error (MSE) between the input and reconstructed patches. After SCAE is learned, its Conv. Layers 1 and 2 are imported to the CNN in Fig. 1. We adopt the SCAE implementation by Paine et al. (2014).
We also find that applying label-preserving data augmentations to synthetic training data helps reduce the domain mismatch. Chen et al. (2014) added moderate distortions and corruptions, including noise, blur, rotations and shading effects. In addition, we also vary the character spacings and aspect ratios when rendering training data. Note that these steps are not useful for the method in Chen et al. (2014) because it exploits very localized features, but they are very helpful in our case.
We implemented and evaluated the local feature embedding-based algorithm (LFE) in (Chen et al. (2014)) as a baseline, and compare it with our model. A SCAE is first trained on a large collection of both synthetic data and unlabeled real world data, and then exports the first convolutional layers. The next layers are trained on labeled synthetic data covering 2,383 classes. That makes our problem quite fine-grain. Testing is conducted on the the VFRWild325 dataset used by (Chen et al. (2014)), in term of top-1 and top-5 classification errors. Our model achieves 38.15% in top-1 error and 20.62% in top-5, which outperforms 6% and 10% over LFE, respectively.
- Chen et al. (2014) Chen, G., Yang, J., Jin, H., Brandt, J., Shechtman, E., Agarwala, A., and Han, T. X. Large-scale visual font recognition. In Proceedings of CVPR, pp. 3598–3605. IEEE, 2014.
Glorot et al. (2011)
Glorot, X., Bordes, A., and Bengio, Y.
Domain adaptation for large-scale sentiment classification: A deep learning approach.In Proceedings of ICML, pp. 513–520, 2011.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Proceedings of NIPS, pp. 1097–1105, 2012.
Masci et al. (2011)
Masci, J., Meier, U., Cireşan, D., and Schmidhuber, J.
Stacked convolutional auto-encoders for hierarchical feature extraction.In Proceedings of ICANN, pp. 52–59. Springer, 2011.
- Paine et al. (2014) Paine, T., Khorrami, P., Han, W., and Huang, T. S. An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597, 2014.