Real-Time Document Image Classification using Deep CNN and Extreme Learning Machines

11/03/2017 ∙ by Andreas Kölsch, et al. ∙ insiders-technologies University of Fribourg 0

This paper presents an approach for real-time training and testing for document image classification. In production environments, it is crucial to perform accurate and (time-)efficient training. Existing deep learning approaches for classifying documents do not meet these requirements, as they require much time for training and fine-tuning the deep architectures. Motivated from Computer Vision, we propose a two-stage approach. The first stage trains a deep network that works as feature extractor and in the second stage, Extreme Learning Machines (ELMs) are used for classification. The proposed approach outperforms all previously reported structural and deep learning based methods with a final accuracy of 83.24 leading to a relative error reduction of 25 Convolutional Neural Network (CNN) based approach (DeepDocClassifier). More importantly, the training time of the ELM is only 1.176 seconds and the overall prediction time for 2,482 images is 3.066 seconds. As such, this novel approach makes deep learning-based document classification suitable for large-scale real-time applications.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Today, business documents (cf. Fig. 1) are often processed by document analysis systems (DAS) to reduce the human effort in scheduling them to the right person or in extracting the information from them. One important task of a DAS is the classification of documents, i.e. to determine which kind of business process the document refers to. Typical document classes are invoice, change of address or claim etc. Document classification approaches can be grouped in image-based [afzal2015deepdocclassifier, lekang-14-a, harley2015icdar, doclass_Kumar12, doclass_Chen12, doclass_Kochi99, doclass_umamaheswara08] and content (OCR) based approaches [tang2016bayesian, diab2017using, li1998classification] (See Section II. DAS often include both variants. Which approach is more suitable often depends on the documents that are processed by the user. Free-form documents like usual letters normally need content-based classification whereas forms that contain the same text in different layouts can be distinguished by image-based approaches.

However, it is not always known in advance what category the document belongs to. That is why it is difficult to choose between image-based and content-based methods. In general, the image-based approach is preferred that works directly on digitized images. Due to the diversity of the document image classes, there exist classes with a high intra-class and low inter-class variance which is shown in Fig. 

2 and Fig. 3 respectively. Hence it is difficult to come up with handcrafted features that are generic for document image classification.

With the increasing performance of convolutional neural networks (CNN) during the last years, it is more straightforward to classify images directly without extracting handcrafted features from segmented objects [afzal2015deepdocclassifier, lekang-14-a, harley2015icdar]. However, these approaches are time-consuming at least during the training process. This means that it may take hours before the user gets feedback if the chosen approach for classification works in his case. In addition, self-learning DAS that train incrementally based on the user’s feedback will not have a good user experience because it just takes too long until the system improves while working with it. The question is if there is an image-based approach for document classification which is efficient in classification and training as well.

Figure 1: Sample images from different classes of the Tobacco-3482 dataset.

Figure 2: Documents from the Advertisement class of the Tobacco-3482 dataset showing a high intra-class variance.

In this paper, we propose to use Extreme Learning Machines (ELM

)s which provide real-time training. In order to overcome both, the hassle of manual feature extraction and long time of training, we devise a two-stage process that combines automatic feature learning of deep

CNNs with efficient ELMs. The first phase is the training of a deep neural network that will be used as feature extractor. In the second phase, ELMs are employed for the final classification. ELMs are different in their nature from other neural networks (see Section III). The presented work in the paper shows that it takes a millisecond on average to train over one image, hence showing a real-time performance. This fact makes these networks also well-suited for usage in an incremental learning framework.

The rest of the paper is organized as follows: Section II describes the related work in the field of document classification. A theoretical background on ELMs is given in Section III. In Section IV the proposed combination of a deep CNN and an ELM is described in detail. Section V explains how the experiments are performed and presents the results. Section VI concludes the paper and gives perspectives for future work.

Ii Related Work

In the last years, a variety of methods has been proposed for document image classification. These methods can be grouped into three categories. The first category utilizes the layout/structural similarity of the document images. It is time-consuming to first extract the basic document components and then use them for classification. The work in the second category is focused on the developing of local and/or global image descriptors. These descriptors are then used for document classification. Extracting local and global features is also a fairly time-consuming process. Lastly, the methods from the third category use CNNs to automatically learn and extract features from the document images which are then classified. Nevertheless, also in this approach, the training process is very time-consuming, even using GPUs. In the following, we give a brief overview of closely related approaches belonging to the three categories mentioned above.

Figure 3: Documents from different classes (Email, Letter, Memo, Report, Resume and Scientific) of the Tobacco-3482 dataset showing a low inter-class variance.

Dengel and Dubiel [doclass_Dengel95]

used decision trees to map the layout structure of printed letters into a complementary logical structure. Bagdanov and Worring 

[doclass_Bagdanov2001] present a classification method for machine-printed documents that uses Attributed Relational Graphs (ARGs). Byun and Lee [doclass_Byun2000] and Shin and Doermann [doclass_shin] used layout and structural similarity methods for document matching whereas, Kevyn and Nickolov [Collins-thompson02aclustering-based] combined both text and layout based features.

In 2012, Jayant et al. [doclass_Kumar12] proposed a method for document classification that relies on codewords derived from patches of the document images. The code book is learned in an unsupervised way on the documents. To do that, the approach recursively partitions the image into patches and models the spatial relationships between the patches using histograms of patch-codewords. Two years later, the same authors presented another method which builds a codebook of SURF descriptors of the document images [doclass_Kumar14]. In a similar way as in their first paper, these features are then used for classification. Chen et al. [doclass_Chen12] proposed a method which uses low-level image features to classify documents. However, their approach is limited to structured documents. Kochi and Saitoh [doclass_Kochi99] presented a method that relies on pre-defined knowledge on the document classes. The approach uses models for each class of documents and classifies new documents based on their similarity to the models. Reddy and Govindaraju [doclass_umamaheswara08]

used pixel information from binary images for the classification of form documents. Their method uses the k-means algorithm to classify the images based on their pixel density.

Most important for this work are the CNN based approaches by Kang et al. [lekang-14-a], Harley et al. [harley2015icdar] and Afzal et al. [afzal2015deepdocclassifier]. Kang et al. have been the first who used CNNs for document classification. Even though they used a shallow network due to limited training data, their approach outperformed structural similarity based methods on the Tobacco-3482 dataset [lekang-14-a]. Afzal et al. [afzal2015deepdocclassifier] and Harley et al.  [harley2015icdar] showed a great improvement in the accuracy by applying transfer learning from the domain of real-world images to the domain of document images, thus making it possible to use deep CNN architectures even with limited training data. With their approach, they significantly outperformed the state-of-the-artat that time. Furthermore, Harley et al.  [harley2015icdar] introduced the RVL-CDIP dataset which provides a large-scale dataset for document classification and allows for training CNNs from scratch.

While deep CNN based approaches have advanced significantly in the last years and are the current state-of-the-art, the training of these networks is very time-consuming. The approach presented in this paper belongs to the third category, but overcomes the issue of long training time. To allow for real-time training while using the state-of-the-artperformance of deep CNNs, our approach uses a combination CNN[cnn_alexnet_nips2014] and ELM[huang2004extreme, huang2006extreme].

Iii Extreme Learning Machines

ELM is an algorithm that is used to train Single Layer Feedforward Network (SLFN[huang2006extreme, huang2004extreme]. The major idea behind ELM

is mimicking the biological behaviour. While general neural network training uses backpropagation to adjust parameters i.e. weights, this step is not required for

ELMs. An ELM learns by updating weights in two distinct but sequential stages. These stages are random feature mapping and least square fitting. In the first stage, the weights between the input and the hidden layers are randomly initialized. In the second stage a linear least square optimization is performed and therefore no backpropagation is required. The point that distinguishes ELM from other learning algorithms is the mapping of input features into a random space followed by learning in that stage.

In a supervised learning setting each input sample has a corresponding class label. Let

and be the input sample and corresponding label respectively.

Let and be the sets of examples and represented as follows where and are the

input and target vectors of

and dimensions respectively. The supervised classification searches for a function that maps the input vector to the target vector. While there are many sophisticated forms of such functions [kotsiantis2007supervised], one simple and effective function is single hidden layer feed-forward network (SLFN). With respect to the setting described above a single layer network with hidden nodes can be depicted as follows


where is the weight matrix connecting the hidden node and the input nodes, is the weight vector that connects the th node to the output and is the bias. The function

represents an activation function that could be

, , etc.

The above was the description of SLFNs. For ELMs the weights between the input and the hidden nodes are randomly initialized. In the second stage, the parameters connecting the hidden and the output layer are optimized using regularized linear least square. Let be the response vector from hidden layer to input and be the output parameter connecting the the hidden and output layer. ELM minimizes the following sum of the squared losses.


The second term in Eq. 2 is the regularizer to avoid the overfitting and is the trade-off coefficient. By concatenating and

we get the following well known optimization problem called ridge regression.


The above mentioned problem is convex and constrained by the following linear system


This linear system could be solved using numerical methods for obtaining optimal


Iv Deep CNN and ELM for Document Image Classification

Figure 4: AlexNet is pretrained on the large scale dataset RVL-CDIP. Then, the fully-connected layers are replaced by an ELM and the other trained layers are copied to the new architecture.

This section presents in detail the mixed CNN and ELM architecture and the learning methodology of the developed classification system.

Iv-a Preprocessing

The method presented in this paper does not utilize document features that require a high resolution, such as optical character recognition. Instead, it solely relies on the structure and layout of the input documents to classify them. Therefore, in a preprocessing step, the high-resolution images are downscaled to a lower resolution of which is the input size of the CNN.

The common approach to successfully train CNNs for object recognition is to augment the training data by resizing the images to a larger size and to then randomly crop areas from these images [cnn_alexnet_nips2014]

. This data augmentation technique has proven to be effective for networks trained on the ImageNet dataset where the most discriminating elements of the images are typically located close to the center of the image and therefore contained in all crops. However, by this technique, the network is effectively presented with less than

of the original image. We intentionally do not augment our training data in this way, because in document classification, the most discriminating parts of document images often reside in the outer regions of the document, e.g.the head of a letter.

As a second preprocessing step, we subtract the mean values of the training images from both the training and the validation images.

Lastly, we convert the grayscale document images to RGB images, i.e.we copy the values of the single-channel images to generate three-channel images.

Iv-B Network Architecture

The deep CNN architecture proposed in this paper is based on the AlexNet architecture [cnn_alexnet_nips2014]. It consists of five convolutional layers which are followed by an Extreme Learning Machine (ELM) [huang2006extreme].

As in the original AlexNet architecture, we get feature maps of size

after the last max-pooling layer (cf. Fig. 


While AlexNet uses multiple fully-connected layers to classify the generated feature maps, we propose to use a single-layer ELM.

The weights of the convolutional layers are pretrained on a large dataset as a full AlexNet, i.e.with three subsequent fully-connected layers and standard backpropagation. After the training has converged, the fully-connected layers are discarded and the convolutional layers are fixed to work as a feature extractor. The feature vectors extracted by the CNN stub then provide the input vectors for the ELM training and testing (cf. Fig. 4).


s used in this architecture is a single-layer feed-forward neural network. We test

ELMs with neurons in the hidden layer and 10 output neurons, as the target dataset has 10 classes. The neurons use sigmoid as activation function.

Iv-C Training Details

As already stated, we train a full AlexNet on a large dataset to provide a useful feature extractor for the ELM and then train the ELM on the target dataset. Specifically, we train AlexNet on a dataset, which contains images from classes. Therefore, the number of neurons in the last fully-connected layer of AlexNet is changed from to .

All, but the last network layer are initialized with an AlexNet model111

that was pretrained on ImageNet. The training is performed using stochastic gradient descent with a batch size of

, an initial learning rate of , a momentum of and a weight decay of . To prevent overfitting, the sixth and seventh layers are configured to use a dropout ratio of . After epochs, the training process is finished. The caffe framework [jia2014caffe] is used to train this model.

The ELMs are trained and evaluated on the Tobacco-3482 dataset [doclass_Kumar14] which contains images from classes. The images are passed through the CNN stub and the activations of the fifth pooling layer are presented to the ELM (cf. Fig. 4).detail

V Experiments and Results

Figure 5: Mean accuracy achieved by the different ELM classifiers in comparison to the original networks.

V-a Datasets

In this paper, two datasets are used. First, we use the Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset [harley2015icdar] to train a full AlexNet. This dataset contains images which are evenly distributed across classes. of the images are dedicated for training, images are each dedicated for validation and testing.

Secondly, we use the Tobacco-3482 dataset [doclass_Kumar14] to train the presented ELM and evaluate its performance. This dataset contains images from ten document classes.

As there exists some overlap between the two datasets, we exclude the images that are contained in both datasets from the large dataset. Therefore, AlexNet is not trained on but only on images.

V-B Evaluation Scheme

To allow for a fair comparison with other approaches on the Tobacco-3482 dataset, we use a similar evaluation protocol as Kang et al. [lekang-14-a] and Harley et al. [harley2015icdar]. Specifically, we conduct several experiments with different training datasets. We only use subsets of the Tobacco-3482 dataset for training ranging from images per class to images per class. The remaining images are used for testing. Since the dataset is so small, for each of these dataset splits, we randomly create ten different partitions to train and evaluate our classifiers and report the median performance. Note, that the ELMs are not optimized on a validation set. Thus, there is no validation set needed.

Structural methods [afzal2015deepdocclassifier]
AlexNet (ImageNet)
AlexNet-ELM (ImageNet)
AlexNet (RVL-CDIP)
Table I: Accuracy achieved on the Tobacco-3482 dataset by the different classifiers with different pretraining. Here, all networks use images per class during training and the rest for testing. The reported accuracy is the mean accuracy achieved on 10 different dataset partitions.
Training Testing
AlexNet (GPU) 10 min, 34 sec 3480 ms
AlexNet-ELM (GPU) 1176 ms 3066 ms
AlexNet (CPU) 6 h, 44 min, 8 sec 4 min, 30 sec
AlexNet-ELM (CPU) 1 min, 26 sec 4 min, 19 sec
Table II: Time needed to train and test the classifiers using a NVidia Tesla K20x as GPU and an Intel i7-6700K @ 4.00GHz as CPU. The testing time is the time required to classifiy the entire test set of images.

V-C Experiments

As a first and baseline experiment, we train AlexNet which is pretrained on ImageNet, on the Tobacco-3482 dataset as was already done by Afzal et al. [afzal2015deepdocclassifier]. As described above, we train multiple versions of the network with different partitions per training data size. In total, 100 networks are trained, i.e. networks on each , , …, training images per class. The training datasets for these experiments are further subdivided into a dataset for actual training () and a dataset for validation () (cf. [harley2015icdar]).

Secondly, we train an ImageNet initialized AlexNet on images of the RVL-CDIP corpus and discard the fully-connected part of the network. The network stub is used as a feature extractor to train and test our ELMs. The ELMs are trained on the Tobacco-3482 dataset as described in section V-B. As these networks depend on random initialization, we train ELMs for each of the partitions and report the mean accuracy for each partition size.

V-D Results

The performance of our proposed classifiers in comparison to the current state-of-the-artis depicted in Fig. 5. As can be seen, the ELM classifier with document pretraining already outperforms the current state-of-the-art with as little as training samples per class. With training samples per class, the test accuracy can be increased from to (cf. Table I) which corresponds to an error reduction of more than .

Together with the exceptional performance boost the runtime needed for both training and testing is reduced (cf. Table II). Especially in the case of GPU accelerated training, the proposed approach is more than times faster than the current state-of-the-art. For both training and testing, the combined CNN/ELM approach needs about ms per image, thus making it real-time. As more than of the total runtime are used for the feature extraction, a different CNN architecture could speed this up even further.

The ELM classifier with ImageNet pretraining achieves an accuracy which is comparable to that of the current state-of-the-art at a fraction of the computational costs.

Note, that AlexNet pretrained on the RVL-CDIP dataset and fine-tuned on the Tobacco-3482 dataset achieves even better performance in terms of accuracy. However, as this would be as slow as the current state-of-the-art, this is not in the scope of this paper. The main idea of this work is to provide a fast and accurate classifier.

A confusion matrix of an exemplary

ELM classifier which was trained on images per class is shown in Fig. 6. As can be seen, the class Scientific is by far the hardest to recognize. This result is consistent with Afzal et al. [afzal2015deepdocclassifier] and can be explained by low inter-class variance between the classes Scientific and Report.

V-E Experiments with deeper architectures

For completeness, we also conduct the described experiments with GoogLeNet [szegedy2015going] and ResNet-50 [he2016deep] as underlying network architectures. As depicted in Fig. 5, the networks perform extremely well.

However, since both of these architectures have only one fully connected layer for classification which is replaced by the ELM, there is no runtime improvement at inference time, but only at training time. Furthermore, due to the depth of these models, we have to drastically reduce the batch size which decreases the degree of parallelism and makes these approaches not viable for real-time training with a single GPU.

Figure 6: Confusion matrix of an exemplary ELM.

Vi Conclusion and Future Work

We have addressed the problem of real-time training for document image classification. In particular, we present a document classification approach that trains in real-time, i.e. a millisecond per image and outperforms the current state-of-the-art by a large margin. We suggest a two-stage approach which uses feature extraction from deep neural networks and efficient training using ELM. The latter stage leads to superior performance in terms of efficiency. Several quantitative evaluations show the power and potential of the proposed approach. This is a big leap forward for DAS that are bound to quick system responses.

An interesting future dimension is the fast extraction of image features, because, in the presented approach over  % of the time is consumed for feature extraction from deep neural networks. Another future experiment is to benchmark the GoogLeNet and ResNet-50 based ELM classifiers in a high-performance cluster.