Today, business documents (cf. Fig. 1) are often processed by document analysis systems (DAS) to reduce the human effort in scheduling them to the right person or in extracting the information from them. One important task of a DAS is the classification of documents, i.e. to determine which kind of business process the document refers to. Typical document classes are invoice, change of address or claim etc. Document classification approaches can be grouped in image-based [afzal2015deepdocclassifier, lekang-14-a, harley2015icdar, doclass_Kumar12, doclass_Chen12, doclass_Kochi99, doclass_umamaheswara08] and content (OCR) based approaches [tang2016bayesian, diab2017using, li1998classification] (See Section II. DAS often include both variants. Which approach is more suitable often depends on the documents that are processed by the user. Free-form documents like usual letters normally need content-based classification whereas forms that contain the same text in different layouts can be distinguished by image-based approaches.
However, it is not always known in advance what category the document belongs to. That is why it is difficult to choose between image-based and content-based methods. In general, the image-based approach is preferred that works directly on digitized images. Due to the diversity of the document image classes, there exist classes with a high intra-class and low inter-class variance which is shown in Fig.2 and Fig. 3 respectively. Hence it is difficult to come up with handcrafted features that are generic for document image classification.
With the increasing performance of convolutional neural networks (CNN) during the last years, it is more straightforward to classify images directly without extracting handcrafted features from segmented objects [afzal2015deepdocclassifier, lekang-14-a, harley2015icdar]. However, these approaches are time-consuming at least during the training process. This means that it may take hours before the user gets feedback if the chosen approach for classification works in his case. In addition, self-learning DAS that train incrementally based on the user’s feedback will not have a good user experience because it just takes too long until the system improves while working with it. The question is if there is an image-based approach for document classification which is efficient in classification and training as well.
In this paper, we propose to use Extreme Learning Machines (ELM
)s which provide real-time training. In order to overcome both, the hassle of manual feature extraction and long time of training, we devise a two-stage process that combines automatic feature learning of deepCNNs with efficient ELMs. The first phase is the training of a deep neural network that will be used as feature extractor. In the second phase, ELMs are employed for the final classification. ELMs are different in their nature from other neural networks (see Section III). The presented work in the paper shows that it takes a millisecond on average to train over one image, hence showing a real-time performance. This fact makes these networks also well-suited for usage in an incremental learning framework.
The rest of the paper is organized as follows: Section II describes the related work in the field of document classification. A theoretical background on ELMs is given in Section III. In Section IV the proposed combination of a deep CNN and an ELM is described in detail. Section V explains how the experiments are performed and presents the results. Section VI concludes the paper and gives perspectives for future work.
Ii Related Work
In the last years, a variety of methods has been proposed for document image classification. These methods can be grouped into three categories. The first category utilizes the layout/structural similarity of the document images. It is time-consuming to first extract the basic document components and then use them for classification. The work in the second category is focused on the developing of local and/or global image descriptors. These descriptors are then used for document classification. Extracting local and global features is also a fairly time-consuming process. Lastly, the methods from the third category use CNNs to automatically learn and extract features from the document images which are then classified. Nevertheless, also in this approach, the training process is very time-consuming, even using GPUs. In the following, we give a brief overview of closely related approaches belonging to the three categories mentioned above.
Dengel and Dubiel [doclass_Dengel95]
used decision trees to map the layout structure of printed letters into a complementary logical structure. Bagdanov and Worring[doclass_Bagdanov2001] present a classification method for machine-printed documents that uses Attributed Relational Graphs (ARGs). Byun and Lee [doclass_Byun2000] and Shin and Doermann [doclass_shin] used layout and structural similarity methods for document matching whereas, Kevyn and Nickolov [Collins-thompson02aclustering-based] combined both text and layout based features.
In 2012, Jayant et al. [doclass_Kumar12] proposed a method for document classification that relies on codewords derived from patches of the document images. The code book is learned in an unsupervised way on the documents. To do that, the approach recursively partitions the image into patches and models the spatial relationships between the patches using histograms of patch-codewords. Two years later, the same authors presented another method which builds a codebook of SURF descriptors of the document images [doclass_Kumar14]. In a similar way as in their first paper, these features are then used for classification. Chen et al. [doclass_Chen12] proposed a method which uses low-level image features to classify documents. However, their approach is limited to structured documents. Kochi and Saitoh [doclass_Kochi99] presented a method that relies on pre-defined knowledge on the document classes. The approach uses models for each class of documents and classifies new documents based on their similarity to the models. Reddy and Govindaraju [doclass_umamaheswara08]
used pixel information from binary images for the classification of form documents. Their method uses the k-means algorithm to classify the images based on their pixel density.
Most important for this work are the CNN based approaches by Kang et al. [lekang-14-a], Harley et al. [harley2015icdar] and Afzal et al. [afzal2015deepdocclassifier]. Kang et al. have been the first who used CNNs for document classification. Even though they used a shallow network due to limited training data, their approach outperformed structural similarity based methods on the Tobacco-3482 dataset [lekang-14-a]. Afzal et al. [afzal2015deepdocclassifier] and Harley et al. [harley2015icdar] showed a great improvement in the accuracy by applying transfer learning from the domain of real-world images to the domain of document images, thus making it possible to use deep CNN architectures even with limited training data. With their approach, they significantly outperformed the state-of-the-artat that time. Furthermore, Harley et al. [harley2015icdar] introduced the RVL-CDIP dataset which provides a large-scale dataset for document classification and allows for training CNNs from scratch.
While deep CNN based approaches have advanced significantly in the last years and are the current state-of-the-art, the training of these networks is very time-consuming. The approach presented in this paper belongs to the third category, but overcomes the issue of long training time. To allow for real-time training while using the state-of-the-artperformance of deep CNNs, our approach uses a combination CNNs [cnn_alexnet_nips2014] and ELMs [huang2004extreme, huang2006extreme].
Iii Extreme Learning Machines
ELM is an algorithm that is used to train Single Layer Feedforward Network (SLFN) [huang2006extreme, huang2004extreme]. The major idea behind ELM
is mimicking the biological behaviour. While general neural network training uses backpropagation to adjust parameters i.e. weights, this step is not required forELMs. An ELM learns by updating weights in two distinct but sequential stages. These stages are random feature mapping and least square fitting. In the first stage, the weights between the input and the hidden layers are randomly initialized. In the second stage a linear least square optimization is performed and therefore no backpropagation is required. The point that distinguishes ELM from other learning algorithms is the mapping of input features into a random space followed by learning in that stage.
In a supervised learning setting each input sample has a corresponding class label. Letand be the input sample and corresponding label respectively.
Let and be the sets of examples and represented as follows where and are the
input and target vectors ofand dimensions respectively. The supervised classification searches for a function that maps the input vector to the target vector. While there are many sophisticated forms of such functions [kotsiantis2007supervised], one simple and effective function is single hidden layer feed-forward network (SLFN). With respect to the setting described above a single layer network with hidden nodes can be depicted as follows
where is the weight matrix connecting the hidden node and the input nodes, is the weight vector that connects the th node to the output and is the bias. The function
represents an activation function that could be, , etc.
The above was the description of SLFNs. For ELMs the weights between the input and the hidden nodes are randomly initialized. In the second stage, the parameters connecting the hidden and the output layer are optimized using regularized linear least square. Let be the response vector from hidden layer to input and be the output parameter connecting the the hidden and output layer. ELM minimizes the following sum of the squared losses.
The second term in Eq. 2 is the regularizer to avoid the overfitting and is the trade-off coefficient. By concatenating and
we get the following well known optimization problem called ridge regression.
The above mentioned problem is convex and constrained by the following linear system
This linear system could be solved using numerical methods for obtaining optimal
Iv Deep CNN and ELM for Document Image Classification
This section presents in detail the mixed CNN and ELM architecture and the learning methodology of the developed classification system.
The method presented in this paper does not utilize document features that require a high resolution, such as optical character recognition. Instead, it solely relies on the structure and layout of the input documents to classify them. Therefore, in a preprocessing step, the high-resolution images are downscaled to a lower resolution of which is the input size of the CNN.
The common approach to successfully train CNNs for object recognition is to augment the training data by resizing the images to a larger size and to then randomly crop areas from these images [cnn_alexnet_nips2014]
. This data augmentation technique has proven to be effective for networks trained on the ImageNet dataset where the most discriminating elements of the images are typically located close to the center of the image and therefore contained in all crops. However, by this technique, the network is effectively presented with less thanof the original image. We intentionally do not augment our training data in this way, because in document classification, the most discriminating parts of document images often reside in the outer regions of the document, e.g.the head of a letter.
As a second preprocessing step, we subtract the mean values of the training images from both the training and the validation images.
Lastly, we convert the grayscale document images to RGB images, i.e.we copy the values of the single-channel images to generate three-channel images.
Iv-B Network Architecture
The deep CNN architecture proposed in this paper is based on the AlexNet architecture [cnn_alexnet_nips2014]. It consists of five convolutional layers which are followed by an Extreme Learning Machine (ELM) [huang2006extreme].
As in the original AlexNet architecture, we get feature maps of size
after the last max-pooling layer (cf. Fig.4).
While AlexNet uses multiple fully-connected layers to classify the generated feature maps, we propose to use a single-layer ELM.
The weights of the convolutional layers are pretrained on a large dataset as a full AlexNet, i.e.with three subsequent fully-connected layers and standard backpropagation. After the training has converged, the fully-connected layers are discarded and the convolutional layers are fixed to work as a feature extractor. The feature vectors extracted by the CNN stub then provide the input vectors for the ELM training and testing (cf. Fig. 4).
Iv-C Training Details
As already stated, we train a full AlexNet on a large dataset to provide a useful feature extractor for the ELM and then train the ELM on the target dataset. Specifically, we train AlexNet on a dataset, which contains images from classes. Therefore, the number of neurons in the last fully-connected layer of AlexNet is changed from to .
All, but the last network layer are initialized with an AlexNet model111https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet
that was pretrained on ImageNet. The training is performed using stochastic gradient descent with a batch size of, an initial learning rate of , a momentum of and a weight decay of . To prevent overfitting, the sixth and seventh layers are configured to use a dropout ratio of . After epochs, the training process is finished. The caffe framework [jia2014caffe] is used to train this model.
The ELMs are trained and evaluated on the Tobacco-3482 dataset [doclass_Kumar14] which contains images from classes. The images are passed through the CNN stub and the activations of the fifth pooling layer are presented to the ELM (cf. Fig. 4).detail
V Experiments and Results
In this paper, two datasets are used. First, we use the Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset [harley2015icdar] to train a full AlexNet. This dataset contains images which are evenly distributed across classes. of the images are dedicated for training, images are each dedicated for validation and testing.
Secondly, we use the Tobacco-3482 dataset [doclass_Kumar14] to train the presented ELM and evaluate its performance. This dataset contains images from ten document classes.
As there exists some overlap between the two datasets, we exclude the images that are contained in both datasets from the large dataset. Therefore, AlexNet is not trained on but only on images.
V-B Evaluation Scheme
To allow for a fair comparison with other approaches on the Tobacco-3482 dataset, we use a similar evaluation protocol as Kang et al. [lekang-14-a] and Harley et al. [harley2015icdar]. Specifically, we conduct several experiments with different training datasets. We only use subsets of the Tobacco-3482 dataset for training ranging from images per class to images per class. The remaining images are used for testing. Since the dataset is so small, for each of these dataset splits, we randomly create ten different partitions to train and evaluate our classifiers and report the median performance. Note, that the ELMs are not optimized on a validation set. Thus, there is no validation set needed.
|Structural methods [afzal2015deepdocclassifier]|
|AlexNet (GPU)||10 min, 34 sec||3480 ms|
|AlexNet-ELM (GPU)||1176 ms||3066 ms|
|AlexNet (CPU)||6 h, 44 min, 8 sec||4 min, 30 sec|
|AlexNet-ELM (CPU)||1 min, 26 sec||4 min, 19 sec|
As a first and baseline experiment, we train AlexNet which is pretrained on ImageNet, on the Tobacco-3482 dataset as was already done by Afzal et al. [afzal2015deepdocclassifier]. As described above, we train multiple versions of the network with different partitions per training data size. In total, 100 networks are trained, i.e. networks on each , , …, training images per class. The training datasets for these experiments are further subdivided into a dataset for actual training () and a dataset for validation () (cf. [harley2015icdar]).
Secondly, we train an ImageNet initialized AlexNet on images of the RVL-CDIP corpus and discard the fully-connected part of the network. The network stub is used as a feature extractor to train and test our ELMs. The ELMs are trained on the Tobacco-3482 dataset as described in section V-B. As these networks depend on random initialization, we train ELMs for each of the partitions and report the mean accuracy for each partition size.
The performance of our proposed classifiers in comparison to the current state-of-the-artis depicted in Fig. 5. As can be seen, the ELM classifier with document pretraining already outperforms the current state-of-the-art with as little as training samples per class. With training samples per class, the test accuracy can be increased from to (cf. Table I) which corresponds to an error reduction of more than .
Together with the exceptional performance boost the runtime needed for both training and testing is reduced (cf. Table II). Especially in the case of GPU accelerated training, the proposed approach is more than times faster than the current state-of-the-art. For both training and testing, the combined CNN/ELM approach needs about ms per image, thus making it real-time. As more than of the total runtime are used for the feature extraction, a different CNN architecture could speed this up even further.
The ELM classifier with ImageNet pretraining achieves an accuracy which is comparable to that of the current state-of-the-art at a fraction of the computational costs.
Note, that AlexNet pretrained on the RVL-CDIP dataset and fine-tuned on the Tobacco-3482 dataset achieves even better performance in terms of accuracy. However, as this would be as slow as the current state-of-the-art, this is not in the scope of this paper. The main idea of this work is to provide a fast and accurate classifier.
A confusion matrix of an exemplaryELM classifier which was trained on images per class is shown in Fig. 6. As can be seen, the class Scientific is by far the hardest to recognize. This result is consistent with Afzal et al. [afzal2015deepdocclassifier] and can be explained by low inter-class variance between the classes Scientific and Report.
V-E Experiments with deeper architectures
For completeness, we also conduct the described experiments with GoogLeNet [szegedy2015going] and ResNet-50 [he2016deep] as underlying network architectures. As depicted in Fig. 5, the networks perform extremely well.
However, since both of these architectures have only one fully connected layer for classification which is replaced by the ELM, there is no runtime improvement at inference time, but only at training time. Furthermore, due to the depth of these models, we have to drastically reduce the batch size which decreases the degree of parallelism and makes these approaches not viable for real-time training with a single GPU.
Vi Conclusion and Future Work
We have addressed the problem of real-time training for document image classification. In particular, we present a document classification approach that trains in real-time, i.e. a millisecond per image and outperforms the current state-of-the-art by a large margin. We suggest a two-stage approach which uses feature extraction from deep neural networks and efficient training using ELM. The latter stage leads to superior performance in terms of efficiency. Several quantitative evaluations show the power and potential of the proposed approach. This is a big leap forward for DAS that are bound to quick system responses.
An interesting future dimension is the fast extraction of image features, because, in the presented approach over % of the time is consumed for feature extraction from deep neural networks. Another future experiment is to benchmark the GoogLeNet and ResNet-50 based ELM classifiers in a high-performance cluster.