The promise of a paperless society much like cold fusion and flying cars, has always been a decade away, and although we do come closer to this dream by the day, there are many industries that still rely heavily on paper based processes. Human communication is vague and context based, even given a precisely structured form to fill out, people find a way to subvert the structure to convey what they really want to say. This tendency to color outside the lines makes automating paper based processes difficult, however recent advances in computer vision and natural language processing have pushed us further toward that utopian paperless future.
Document analysis systems follow a general abstract structure, with three overall components: text/structure extraction, page classification/sorting, and content understanding. In a commercial setting there is almost always extraneous and possibly confounding information contained within a submitted document, requiring a robust and flexible way to classify the pages of a given document to ensure the isolation of the correct source of information. This is especially relevant when building a larger pipeline where downstream processes rely on the results of page classification. In these situations, a small incremental boost in classification performance can net much larger performance boosts for the overall pipeline.
The topic of document page image classification has received much publicity over the last few years. In fact the RVL-CDIPharley2015icdar dataset was curated specifically to test image classification strategies on document images. Earlier studies focused heavily on the original AlexNet Krizhevsky:2012:ICD:2999134.2999257 architecture harley2015icdar ; tensmeyer2017analysis . More recently modern architectures such as VGG16 simonyan2014deep , GoogLeNet DBLP:journals/corr/SzegedyLJSRAEVR14 , and ResNet50DBLP:journals/corr/HeZRS15 have been proposed and tested on RVL-CDIP Afzal_2017
. The current state-of-the-art utilizes a set of 5 distinct VGG16 models, one for the whole image (known as the holistic model, initialized with pretrained ImageNet weights) and 4 for specific subsections of the image (header, footer, left body, and right body initialized on the holistic trained weights). These 5 models are then combined to form a final predictiondas2018document . While accurate, the number of parameters is immense (on the order of ) and the training process is sequential, requiring a holistic model to be trained before any of the subsection models can be trained.
In addition to the aforementioned image classification strategies, we can take advantage of optical character recognition (OCR) technology to extract text from document page images and train text classification algorithms. Modern OCR systems are not infallible, especially in the context of low quality scanned documents. Typically the output of an OCR system will contain transcription errors (ex. mistaking i for l and vice versa) due to noise in the source image. Many approaches have been developed to deal with text classification problems DBLP:journals/corr/abs-1904-08067 , although most have been developed under the assumption of clean encoded text. There is evidence to support that the bag-of-words approach is quite robust to the unavoidable transcription errors vinciarelli2005noisy ; agarwal2007much . To the best of our knowledge, there are no studies showing similar analysis for word embedding methods, however it can by hypothesized that transcription errors are amplified in the embedding space, opening an avenue for future research.
Given both the image and text classification approaches, it is natural to design a system that combines both to form a joint modelling approach, typically referred to as a multimodal classification model. This is not a new idea and in fact we find literature dating back before the explosion in popularity of convolutional neural networks (CNN) for image classification.augereau2014improving . More recently a study has been conducted utilizing a similar procedure to the proposed work, with a focus on minimal model footprint in commercial application audebert2019multimodal . Another commercial study was also conducted, utilizing the proposed abstract structure with a private dataset enginmultimodal . Both of these studies suggest that adding text information improves model performance substantially.
In this work we explore combining both approaches into a single classification task, i. e. we construct a model that uses both the visual information and the textual content of page to make a decision. To test the proposed architecture we take advantage of an open and freely available dataset, RVL-CDIP iiihttps://www.cs.cmu.edu/ aharley/rvl-cdip/. We show that the proposed method exceeds the current state-of-the-art performance on this dataset with a test accuracy of 93.03%.
Ii Proposed Method
ii.1 Text Extraction
We utilize the open sourceiiiiiihttps://github.com/tesseract-ocr/tesseract Tesseract OCR engine smith2007overview to extract text from all images in the RVL-CDIP dataset. It is important to note that the only preprocessing step involved is resizing such that the longest dimension is 3300 pixels. This choice was made to ensure conformity to the suggested minimum DPI of 300 with the assumption that every page is standard letter size (this appears to generally be true for this dataset). We use the the combined legacy/LSTM engine (oem 3) and the standard page segmentation mode (psm 3) parameters for this extraction.
ii.2 Abstract Model Architecture
We define the abstract structure of the model as having three components, an image classifier, a text classifier, and a meta-classifier that joins the two prior components’ predictions into one (Fig. 1). We opt for the “late fusion” scheme for joining predictions, assuming each classifier has an output of dimension , where is the number of classes, then our meta-classifier is a mapping from . That is to say that the meta-classifier takes two outputs and maps it to one.
The modular nature of this structure allows for the swapping of different classifiers with relative ease and seems to point to a possible generalized procedure for developing page classification modules within a document analysis pipeline. Extending this idea it is easy to imagine that if a new representation was developed (graph representation for example) then one could add a new model trivially without the need of retraining the other two components and only needing to update the meta-classifier.
ii.3 Image Model Architectures
We utilize two standard CNN architectures for the image models, the first is AlexNet (Fig. (a)a
) with added batch normalization, the second is VGG16 (Fig.(b)b). Both models have input dimensions of
and a 16 neuron softmax output layer (corresponding to the 16 classes of RVL-CDIP, similar to those in Afzal et alAfzal_2017 ). Since the source images are grayscale we convert these to RGB and rescale the pixel values to lie within the range .
ii.4 Text Model Architectures
The raw text is first preprocessed into one-hot vectors, that is to say that each document is represented by a binary vector whose components indicate the presence of the word corresponding to that index. These document vectors are fed into a relatively shallow network (see FIG.3). We denote these as Bag-of-Words (BoW) followed by the number of vocabulary items retained. For example BoW-100K refers to the bag of words model with 100 000 vocabulary words used as features, meaning the input vectors are 100 000 dimensional.
The meta-classifier in all experiments is an XGBoost modelChen:2016:XST:2939672.2939785 . We do not use any regularization parameters instead opting to limit the depth of the trees to control for overfitting (to a maximum depth of 3). The minimal tuning required for this classifier makes it an ideal candidate for meta-classification.
Iii Experiments and Results
All network models are generated using Keraschollet2015keras
with Tensorflow backendtensorflow2015-whitepaper . We also utilize a number of modules from scikit-learn scikit-learn to preprocess the text. We take advantage of the XGBoost library for the meta-classifier. We consistently surpass the current state of the art however exact replication with Tensorflow on GPU is a continuing challenge, with many possible sources of non-deterministic behavioriiiiiiiiihttps://github.com/NVIDIA/tensorflow-determinism.
In their study, Tensmeyer et al tensmeyer2017analysis suggest that slight shear augmentations () during training provide the best generalization performance. We combine these shear augmentations with slight rotations () in training both the VGG16 and AlexNet models. We also note that although 90 degree rotations do not improve performance on this task, in many real-world applications however, this is absolutely necessary as the orientation of the page is not as tightly controlled. Additionally, we experimented with the addition of salt and pepper noise (random minimizing and maximizing of pixels) to simulate scanner effects, this too did not prove to be fruitful in terms of performance.
We utilize SGD with warm restarts DBLP:journals/corr/LoshchilovH16a
, however we adjust the learning rate over batches as opposed to epochs, essentially reducing to a discontinuous one-cycle learning rate cosine annealingDBLP:journals/corr/abs-1803-09820 for optimization. The exact decay function is given by:
Initial learning rate.
Desired minimum learning rate.
batch number within the epoch.
number of batches per epoch.
This policy works well across applications and remains consistent for all models (both image and text) with some adjustment to the bounds (Table. 1). The policy tends to find a strong local minimum however it can accelerate past the best general solution. It may be worthwhile attenuating the schedules peak-to-peak range over epoches or scaling the periodicity as training progresses, however this can decrease the optimizers ability to “pop out” of local minima.
As each classifier is trained independently from one another we can see the results of each experiment and the remarkable boost that comes from the combination of different classifiers.
|Model||Validation Accuracy||Test Accuracy|
|AlexNet random init.||86.29%||86.24%|
|VGG16 ImageNet init.||90.45%||90.24%|
|Image Model||Text Model||Validation Accuracy||Test Accuracy|
|Source||Reported Test Accuracy||Comments|
|Afzal et al Afzal_2017||90.97%||Single well tuned VGG16 intialized on pretrained ImageNet weights.|
|Das et al das2018document||91.11%||Single well tuned VGG16 intialized on pretrained ImageNet weights.|
|Das et aldas2018document||92.21%||Ensemble of holistic and region based VGG16s.|
|Proposed Work||93.03%||Ensemble of VGG16 and MLP based BoW models.|
|Proposed Work||93.07%||Ensemble of all component models.|
We see that even the addition of a low “resolution” bag-of-words model can generate significant lift to the image models superior performance. It is also interesting to note that the combination of randomly initialized AlexNet and BoW-10K beats out the best reported test accuracy for a single image classifier das2018document , exceeding the performance of the well tuned VGG16. While the best performing model consists of a VGG16 image component and 200 000 word text model, with a test accuracy of 93.03%.
The modular nature of this architecture also allows for the simultaneous ensembling of all the component models, resulting in a validation and test accuracy of 93.12% and 93.07% respectively. Although an interesting result, this type of ensembling is likely not practical in an industrial scenario due to the requirement of evaluating the 10 component model and single ensemble model.
It is clear from the results that the inclusion of extracted text in the development of document classification models improves the quality and accuracy of predictions. The proposed method exceeds the current state-of-the-art for test accuracy on the RVL-CDIP dataset and sets a new standard for document classification methods to be compared to.
The work here only takes advantage of a bag-of-words approach to the text classification component, a further avenue for research could include extending the more recent embedding approaches to account for transcription errors.
v.1 RVL-CDIP Data Quality
The open RVL-CDIP dataset suffers from some data quality issues, namely duplicated images across sets (training, testing, and validation) and classes. i.e. the same image can occur across classes and sets. The most obvious example of this type of image is illustrated in figure 4
. Although further study is requred into the data quality of RVL-CDIP, the problem does not seem to be far reaching with an estimated upper bound of 2259 duplicate images. We arrived at this number by examining the unique texts extracted from Tesseract. A more thorough examination is required (potentially with an image hashing technique) to establish the true number of duplicated images.
|Class||Count of Duplicate Image|
- (1) A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in International Conference on Document Analysis and Recognition (ICDAR), 2015.
- (2) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, (USA), pp. 1097–1105, Curran Associates Inc., 2012.
- (3) C. Tensmeyer and T. Martinez, “Analysis of convolutional neural networks for document image classification,” 2017.
- (4) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
- (5) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014.
- (6) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
- (7) M. Z. Afzal, A. Kolsch, S. Ahmed, and M. Liwicki, “Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification,” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017.
A. Das, S. Roy, U. Bhattacharya, and S. K. Parui, “Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks,” 2018.
- (9) K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. E. Barnes, and D. E. Brown, “Text classification algorithms: A survey,” CoRR, vol. abs/1904.08067, 2019.
- (10) A. Vinciarelli, “Noisy text categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1882–1895, 2005.
- (11) S. Agarwal, S. Godbole, D. Punjani, and S. Roy, “How much noise is too much: A study in automatic text classification,” in Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 3–12, IEEE, 2007.
- (12) O. Augereau, N. Journet, A. Vialard, and J.-P. Domenger, “Improving classification of an industrial document image database by combining visual and textual features,” in 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 314–318, IEEE, 2014.
- (13) N. Audebert, C. Herold, K. Slimani, and C. Vidal, “Multimodal deep networks for text and image-based document classification,” 2019.
- (14) D. Engin, E. Emekligil, M. Y. Akpınar, B. Oral, and S. Arslan, “Multimodal deep neural networks for banking document classification,”
- (15) https://www.cs.cmu.edu/ aharley/rvl-cdip/.
- (16) https://github.com/tesseract-ocr/tesseract.
- (17) R. Smith, “An overview of the tesseract ocr engine,” in Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633, IEEE, 2007.
- (18) T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, (New York, NY, USA), pp. 785–794, ACM, 2016.
- (19) F. Chollet et al., “Keras.” https://keras.io, 2015.
M. Abadi, , et al.
, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.Software available from tensorflow.org.
- (21) F. Pedregosa et al., “Scikit-learn: Machine Learning in Python ,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. Software available from scikit-learn.org.
- (22) https://github.com/NVIDIA/tensorflow-determinism.
- (23) I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with restarts,” CoRR, vol. abs/1608.03983, 2016.
- (24) L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay,” CoRR, vol. abs/1803.09820, 2018.