In this work we focus on the problem of document image representation and understanding. Given images of documents, we are interested in learning how to represent the documents to perform tasks such as classification, retrieval, clustering, etc. Document understanding is a key aspect in, for instance, digital mail-room scenarios, where the content of the documents are used to route incoming documents to the right workflow, extract relevant data, annotate the documents with additional information such as priority or relevance, etc.
Traditionally, there has been three main cues that are taken into account when looking to represent and understand a document image: visual cues, structural cues, and textual cues . The visual cues describe the overall appearance of the document, and capture the information that would allow one to differentiate documents “at a glance”. The structural cues explicitly capture the relation between the different elements of the documents, for example by performing a layout analysis and encoding the different regions in a graph. Although visual descriptors can capture similar information implicitly, the representations based on structural cues focus on capturing them in an explicit manner. Finally, textual cues capture the textual information of the document, which can contain important semantic information.
In many cases, these cues contain complementary information, and their combined use would be desired. Unfortunately, obtaining structural and textual features is usually computationally expensive, and these costs become prohibitive in large scale domains. For example, structural features usually require a layout analysis of the document, which is slow and error prone. Similarly, textual cues usually require to perform OCR on the entire document, which is once again slow and error prone. Moreover, these two kinds of features are very domain-specific, and, in general, do not transfer well between different domains and tasks.
On the other hand, visual features are usually fast to obtain while being quite generic. This has motivated their use in many document understanding works [10, 21, 46, 2, 45, 5, 25, 16, 17, 43, 18, 19, 33]. Although not as expressive as pure structural features, visual features can typically encode some coarse structure of the image, while textual information can be added as an additional step depending on the specific domain 
. Recently, deep learning techniques have been used to build visual representations of documents, showing promising results and outperforming handcrafted, shallow visual features in classification and retrieval tasks[24, 20].
Motivated by their success and advantages, in this work we focus on visual features for document representation, and propose a comprehensive experimental study where we compare handcrafted, shallow features with more recent, learned features based on deep learning. In particular, although deep features have shown outstanding performance in many computer vision tasks, only a few works have focused on learning features for document images using convolutional networks (e.g.[24, 20]), and their comparison with other shallow methods has been limited. This recent shift towards deep learning in document image understanding raises two questions: First, given a task and a dataset, do these deep methods outperform shallow features in all the cases? Second, how well do they transfer to different domains (i.e. datasets) and to different tasks, if one wants to reduce their training cost by reusing a pre-trained representation or model? These crucial questions have not been addressed in detail yet.
Additionally, some hybrid architectures have recently been proposed for natural image classification . Built on top of shallow features, they also include several layers that allow them to be trained end-to-end similarly to deep models. The underlined motivation is to combine the advantages of shallow features (faster training and good generalization) with the expressiveness of deep architectures. In this paper we propose to evaluate them in the context of document image understanding in comparison with shallow features and deep convolutional networks.
Our contribution is therefore threefold:
First, we benchmark several standard features against different flavors of recently proposed deep features on the document classification task.
Second, we explore hybrid architectures as an appealing comprise between reusable but weaker shallow features, and specialized but high-performing deep features.
Third, we evaluate the transferability of all these features across domains and across tasks.
Accordingly, this article is organized as follows. In Section 2, we review related work. Section 3 describes the different feature representations that we consider for this work. Section 4 details the training procedure with and without domain shift. Section 5 describes the datasets and implementation details used in our experiments. The experimental results in Section 6 are divided into two. The first set of experiments (Section 6.1) compares all the features with a standard protocol to tackle document image classification. The second set of experiments (Section 6.2) studies transferability of the features on different datasets and different tasks. Finally, Section 7 concludes our benchmark study.
2 Related Work
Traditional visual features for document images usually rely on simple statistics computed directly from the image pixels. For example, Heroux et al.  propose a multi-scale density decomposition of the page to produce fixed-length descriptors constructed efficiently from integral images. A similar idea is presented by Reddy and Govindaraju 
, where representations based on low-level pixel density information are classified using adaptive boosting. Cullenet al.  propose to use a combination of features including densities at interest points, histograms of the size and the density of the connected components and vertical projection histograms. Bagdanov and Worring  propose a representation based on density changes obtained with different morphological operations, while Sarkar  describes document images as a list of salient Viola-Jones based features. Joutel et al.  propose the use of curvelets to capture information about handwritten strokes in the image. However, this descriptor is tailored to the specific task of retrieving images with similar handwriting styles, and their use beyond that particular task is limited.
Some more elaborate representations, such as the RunLength histograms [5, 25, 18], have shown to be more generic and hence better suited for document image representation. Many of these representations can be combined with spatial pyramids  to explicitly add a coarse structure, leading to higher accuracies at the cost of higher-dimensional representations. However, in general, all these traditional features contain relatively limited amount of information and while they might perform well on a specific dataset and task for which they were designed, they are not generic enough to be able to handle various document class types, datasets and tasks.
On a different direction, some more recent works [19, 15, 27, 8] have drawn inspiration from representations typically used for natural images, and have shown that popular natural image representations such as the bag-of-visual-words (BoV)  or the Fisher-Vector  built on top of densely-extracted local descriptors such as SIFT  or SURF  lead to notable improvements. All the latter representations are in general task-agnostic. They get combined with the right algorithm, such as a classifier, or a clustering method, in order to produce the right prediction depending on the target application. These shallow features were shown to generalize very well across tasks .
Recently, deep features, and convolutional neural networks (CNN) in particular, were applied to document images and have shown better classification and retrieval performances than some shallow features (BoV) [20, 24]. The characteristic of deep features is that they are learnt end-to-end. This means that the two previously distinct steps of i) feature construction and ii) prediction (classification in most of the cases) are merged into one step. In other words, the feature and the classifier are learnt jointly and cannot be distinguished any more. They have been recently shown to outperform some shallow features (BoV) by a large margin , but they are highly specialized for a specific task, and their use as a generic feature extractor for document images has not been studied in detail. Also, they are a lot more costly to train, as learning can easily take several days on a GPU.
3 Feature Representations for Document Images
In this paper, we consider a broad range of feature representations for document images. First, we select two shallow features that were successfully used in various document image tasks [19, 15, 8]: the Runlength feature  (Section 3.1) and the Fisher-Vector  representation (Section 3.2). We also experiment with deep features, more precisely two different convolutional neural network architectures the AlexNet  and GoogLeNet  described in Section 3.3. Finally, we briefly recall from  the hybrid architecture in Section 3.4, that is the first time used for document image representation.
3.1 RunLength features
The main intuition behind the RunLength (RL) features  is to encode sequences of pixels that share the same value and that are aligned (e.g. vertically, horizontally or diagonally). The ”run-length” is the length of those sequences (see e.g. the green rectangles in the Figure 1).
. Therefore, we first binarize the document images and consider only runs of black and white pixels. In case of color images, we binarize the luminance channel using a simple thresholding at 0.5 (where image pixels intensities are represented between 0 and 1). More complex binarization techniques exist (seee.g. participations in the DIBCO and HDIBCO  contests), however testing them is out of the scope of this paper.
Note that optionally, we can resize the images after binarization to have the same resolution within the dataset. In our experiments, we select a maximum number of pixels (250M) and we downscale all images that are larger, keeping the aspect ratio, but we do not upscale images that are below this target size.
On the binarized images, the numbers of (black or white) pixel runs are collected into histograms. As suggested in [15, 18], we use a logarithmic quantization of the lengths to build these histograms in order to be less sensitive to noises and small variations :
This yields two histograms of length per direction, one for the white pixels and one for the black pixels. We compute these runs in four directions, horizontal, vertical, diagonal and anti-diagonal, and concatenate all the obtained histograms. An image (or image region) is then represented by this dimensional RL histogram.
In order to better capture information about the page layout we use a spatial pyramid  with several layers such that at each level the image is divided into regions and the RL histograms computed on these regions are concatenated to obtain the full image signature (see illustration in Figure 1). To obtain the final RL image feature, we L1-normalize and apply component-wise squarooting as in . As in  best performances were obtained with 5 Layers () and , we use this configuration and hence in our experiments the final RL features are of dimensional.
3.2 Fisher-Vector representations
that goes beyond simple counting (0-order statistics) and that encodes higher order statistics about the distribution of local descriptors assigned to visual words. Similarly to the BoV, the FV depends on an intermediate representation: the visual vocabulary. The visual vocabulary can be seen as a probability density function (pdf) which models the emission of the low-level descriptors in the image. We represent this density by a Gaussian mixture model (GMM).
The FV characterizes the set of low-level features (in our case SIFT features ), extracted from an image by encoding necessary modifications of the GMM model in order to best fit this particular feature set. Assuming independence, this can be written as the gradient of the log-likelihood of the data on the model:
where , and denote respectively the weight, mean vector and covariance matrix of the Gaussian and is the number of Gaussians in the mixture.
To compare two images and , a natural kernel on these gradients is the Fisher Kernel , where is the Fisher Information Matrix. As is symmetric and positive definite, it has a Cholesky decomposition and can be rewritten as a dot-product between normalized vectors where:
to which we refer as the Fisher-Vector (FV) of the image .
where and are the elements of the diagonal . The final gradient vector concatenates all and , and is -dimensional, where is the dimension of the low level features . As proposed in  we apply a component-wise squarooting followed by L2-normalization to produce the final Fisher-Vectors. The full process is illustrated in Figure 2.
In our experiments we consider either this image-level FV with a large number of Gaussians in the vocabulary () or a spatial pyramid version111We use a single-layer for the spatial-pyramid. Initial experiments with multiple layer spatial pyramids as in the case of RL did not improve results. with smaller vocabulary sizes (pyramid is combined with and with ). This consistently yields a 40960 dimensional vector representation222We reduce SIFT features from 128 to 77 dimension, and add the center and the scale of the patch in order to capture some location information, i.e. ..
3.3 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are composed of several layers that combine linear as well as non-linear operators jointly learned, in an end-to-end manner, to solve a particular task. Typically, they have a standard structure: stacked convolutional layers (optionally combined with contrast normalization and max pooling), followed by one or more fully-connected layers, and a softmax classifier as the final layer. Therefore, a feed-forward neural network can be thought of as the composition of a number of functions
where each function takes as input and a set of parameters , and produces as output.
Convolutional layers are the core building blocks of CNNs and consist of a set of small and learnable filters that extend through the full depth of the input volume and slides across width and height. Max pooling layers are inserted in-between successive convolutional layers in order to progressively reduce the spatial size of the representation and the amount of parameters of the network. Hence they also control the over-fitting. Local contrast operation layers are used to normalize the responses across feature maps. The fully-connected layers are linear projections, i.e
. matrix multiplications followed by a bias offset, where the neurons are connected to all activations of the previous layer. CNNs also useReLU non-linearities (), which rectify the feature maps to ensure they remain positive.
Although the architecture of these networks, which is defined by the hyperparameters and the arrangement of these blocks, are commonly handcrafted, the parameters setof the network are learned in a supervised manner from a set of labeled images 4).
Since their introduction in the early 1990’s (LeNet) 
, and mostly since their recent success in various challenges including the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), many different CNN architectures have been proposed [26, 49, 52, 47]. In this paper we focus on two popular ones: AlexNet  and GoogLeNet .
The AlexNet architecture, proposed by Krizhevsky et al. 
, was the first successful CNN architecture for the image classification task, outperforming by a large margin shallow methods in the ILSVRC 2012 competition. This network is composed of eight layers with weights. Five convolutional layers with (96, 256, 384, 384, 256) kernels of sizes (11, 5, 3, 3, 3) and a stride of 1 pixel, except for the first layer that has a stride of 4. A response normalization is applied after layers 1 and 2, and a max pooling with size 2 and a stride of 2 pixels are applied after layers 1, 2, and 5. This is followed by three fully connected layers of sizes 4096, 4096, andrespectively, where is the number of classes. The output of the last fully-connected layer is fed to a -way softmax which produces a distribution over the class labels. A ReLU non-linearity is applied after every convolutional or fully connected layer. The network is fed with fixed-size images333Note that in contrast to shallow features where the aspect ratio is kept when resizing the images, here the aspect ratio can be modified.. The architecture is summarized in Figure 3.
Szegedy et al. set a new state-of-the-art in image classification and object recognition in the ILSVRC 2014 competition with a significantly different architecture, the GoogLeNet . It uses a deeper and wider architecture than traditional CNNs, with 10 times fewer parameters compared to standard CNNs.
The main idea behind is the inception architecture, based on finding out how an optimal local sparse structure in a convolutional neural network can be approximated and covered by readily available dense components. For that, GoogLeNet relies on several inception layers (Figure 4), where each such layer uses a series of trainable filters with sizes , and . In this way, there are multiple filter sizes per layer, so each layer has the ability to target the different feature resolutions that may occur in its input. In order to avoid computational blow up, it also performs dimensionality reduction by convolutions inserted before the expensive and
convolutions. Finally, the inception module also includes a parallel pooling path, which is concatenated along with the output of the convolutional layers into a single output vector forming the final output.
One benefit of the inception architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up of the computational complexity. Therefore, an inception-based network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. Another important characteristic of this architecture is that it uses average pooling instead of fully connected layers at the top of the last inception layer, eliminating in this way a large amount of parameters.
The GoogLeNet architecture that won the ILSVRC2014 challenge is shown in Figure 5. It is a network of 22 layers where nine inception modules are stacked after two convolutional layers with filter sizes of 7 and 3 and strides of 2 and 1. Max pooling layers with size 3 and stride 2 are inserted after convolutional layers 1 and 2, and after the inception layers 3b, 4e, and 5b. An average pooling layer with size 7 and stride 1 follows the last inception layer 5b, whose output is fed to a single fully-connected layer and the -way softmax classifier. All the convolutions, including those inside the inception modules, use rectified linear activation (ReLU).
3.4 Hybrid descriptors
Our last representation is a hybrid descriptor 
that is built using a hybrid architecture drawing inspiration from both FVs and CNNs. This architecture combines an unsupervised part, obtained by an image-level patch-based Fisher-Vector encoding, and a supervised part composed of fully connected layers. The intuition of this model is to replace the convolutional layers of the CNN architecture with a FV representation and to learn subsequent fully-connected layers in a supervised way, akin to a Multi-Layer Perceptron (MLP), trained with back-propagation. We provide details below for the resulting hybrid architecture, illustrated in Figure6.
The unsupervised part of the hybrid architecture is identical to the FV representation described in section 3.2. We consider two versions. In the first one, the FV representation is followed by a PCA projection and L2-normalization, as originally proposed in . Alternatively, we consider a hybrid architecture that directly builds on the full dimensionality FVs, in which case, the dimensional reduction is performed implicitly by the first fully connected layer of the supervised part of the architecture (this would be illustrated by a modified Figure 6 where = FV directly).
The supervised part uses a set of fully connected layers of sizes 4096 and a last layer of size , where is the number of classes, and a ReLU non-linearity is applied after every fully connected layer. Like in AlexNet and GoogLeNet, the output of the last fully-connected layer is fed to a -way softmax which produces a distribution over the class labels.
As an alternative, we also replace the unsupervised part of this architecture by RunLength histograms described in section 3.1.
As shallow, deep and hybrid features require different learning paradigms, this section details the training procedure for them in the context of the two scenarios that we consider in the experimental part. In the first one, training and testing are done on the same dataset, for the same task (Section 4.1). In the second one, both the dataset and the task can vary, and a transfer mechanism is needed (Section 4.2).
4.1 Training for the task at end
The RL does not require any training, all the parameters are already predefined, so this descriptor is truly dataset-agnostic. The FV requires a visual codebook that is learned in an unsupervised manner (by clustering local features extracted from the training set). Beyond this unsupervised training step, this descriptor does not depend on the data, and more importantly on the labels, hence it is independent from the task. To solve a classification problem, document images features and document labels are used to train a classifier. In all our experiments we use a linear Support Vector Machine (SVM) classifier on top of RL or FV features.
Unlike the previous two representations (RL and FV), CNNs are deep learning approaches that group feature extraction and prediction into a single architecture. Consequently, features are learned to optimize the prediction task, and the classifier is already integrated in the architecture. Therefore, at test time, the full architecture is used to predict the document label.
The set of parameters of the CNN, which includes filters in the convolutional layers, weight matrices and biases in the fully-connected layers, are learned in a supervised manner from a set of labeled images using a suitable loss function for the task to be solved. In our case, we will train both AlexNet and GoogLeNet for document image classification. To train the parameters, we use the standard objective that involves minimizing the cross-entropy between the network output and the ground-truth:
where is the ground-truth label, is the prediction of label for image as output by the last layer, is the number of classes and is the number of available training examples. We update the parameters via stochastic gradient descent (SGD)  by back-propagating the derivative of the loss with respect to the parameters throughout the network. To avoid over-fitting, we use drop-out  at the input of the fully-connected layers.
The hybrid descriptors share similarities both with traditional features (FV or RL depending on what is used in the unsupervised part) and with deep features. In the case of FV, the unsupervised part requires learning the visual codebook on patches extracted from the dataset. The supervised part is trained end-to-end with the classifier integrated in the architecture. To learn the parameters of the supervised part, as for the CNNs, we minimize the cross-entropy between the label predicted by the last layer and the ground-truth labels. Weights are updated using back-propagation. Again, we use drop-out.
4.2 Training for a different task
In the case of shallow features (RL or FV), descriptors can be used for a different task, and they only need to be combined with the right predictor (ranking, new classifier, clustering algorithm, etc.).
In the case of CNNs, besides its common use to solve a given task in an end-to-end manner, it has also become a standard practice to use them as feature extractors. Convolutional filters in the first layers can be seen as detectors of basic structures, like corners or straight lines, while deeper layers are able to capture more complex structures and semantic information. Therefore, a given image can be feed-forwarded through the CNN and the activations of intermediate layers used as mid-level features to represent it. These off-the-shelf features can be subsequently combined with the right prediction algorithm. This finding was quantitatively validated for a number of tasks including image classification [12, 34, 52, 6, 40], image retrieval [40, 1], object detection , and action recognition . We show that these findings also generalize for document images.
For the CNNs, we extract features from different layers at different depths and compare their performance in the experimental section. In the case of AlexNet, we use as features the output activations of the last convolutional layer (pool5), and the output of the first two fully-connected layers (fc6 and fc7). In the case of GoogLeNet, we consider the output of different inception layers and the output of the average pooling layer previous to the fully-connected layer (p5s1). Concretely, we use inception layers 3a, 3b, 4a, 4e and 5b (see Figure 5). For the hybrid architecture, we experimented with the output activations of different fully-connected layers. We L2 normalize the activation features from both CNNs and hybrid architectures before feeding them in the desired predictor.
|dataset||# images||image size||# categ.||category description|
|CLEF-IP||38081||1.5K - 4.5M||9||patent image types|
|IH2||884||1.4M||72||document types and layout|
|IH3||7716||0.5M - 5M||63||fine-grained document types|
5 Evaluation framework
We conducted a broad experimental study comparing the feature representations described in the previous section on seven different datasets. We used four publicly available datasets, namely RVL-CDIP, NIST, MARG, and CLEF-IP. We also confirm our conclusions on three in-house customer datasets, that we refer to as IH1, IH2, and IH3. Statistics on the different datasets can be found in the Table 1 and some illustrations in Figures 7, 8 and 9. We detail their characteristics below.
The Ryerson Vision Lab Complex Document Information Processing (RVL-CDIP) dataset444http://scs.ryerson.ca/~aharley/rvl-cdip/  is a subset of the IIT-CDIP Test collection . It is composed of 400000 images labeled with one of the following 16 categories: letter, memo, email, filefolder, form, handwritten, invoice, advertisement, budget, news article, presentation, scientific publication, questionnaire, resume, scientific report, and specification. Figure 7 shows 3 examples of each class.
The NIST Structured Forms Reference Set555http://www.nist.gov/srd/nistsd2.cfm  is a dataset of black-and-white images that consists of 5590 pages of synthesized documents. These documents correspond to 12 different tax forms from the IRS 1040 Package X for the year 1988 (see examples in Figure 8). Class names are Forms 1040, 2106, 2441, 4562, 6251 and Schedules A, B, C, D, E, F, SE.
The Medical Article Records Ground-truth (MARG) dataset666https://ceb.nlm.nih.gov/proj/marg/marg.php  consists of 1553 documents, each document corresponding to the first page of a medical journal. The dataset is divided into 9 different layout types. These layouts vary in relative position of the title, the authors, the affiliation, the abstract and the text (see examples from four classes in Figure 8). Within each layout type, the document can be composed of one, two or three columns. This impacts visual similarity a lot and makes classification and even more clustering on this dataset very challenging.
The CLEF-IP dataset is the training set777http://www.ifs.tuwien.ac.at/~clef-ip/download/2011/index.shtml released for the Patent Image Classification task of the Clef-IP 2011 Challenge . In the challenge, the aim was to categorize patent images (i.e. figures) into 9 categories: abstract drawing, graph, flowchart, gene sequence, program listing, symbol, chemical structure, table and mathematics. We show example images grouped by class in Figure 8. The dataset contains between 300 and 6000 labeled images for each class, 38081 images in total, with a large variation of the image size (from as little as 1500 pixels to more than 4M pixels) and aspect ratio (from 1 to more than 10).
The first in-house dataset (IH1) regroups internal document images from a single customer. It contains 11252 scanned documents from 14 different document categories such as invoices, contracts, IDs, coupons, handwritten letters, etc.
The second in-house dataset (IH2) is a small dataset of 884 multi-page documents888We only consider the first page. from a single customer, divided into 72 fine-grained categories representing both the document type such as invoice, mail, table, map and the document layout (e.g. “a mail with an excel table on the bottom”, “a table with black lines separating the rows”).
The third in-house dataset (IH3) contains 7716 documents collected from several customers. We divided the dataset into 63 fine-grained categories such as diverse types of forms, invoices, contracts, etc, where the class labels were defined on one hand by the generic document type but also by their origin. Hence invoices, mails, handwritten or typed letter that belongs to different customers were considered as independent classes. The aim with this dataset was to go beyond generic document types or layout and simulate real document image based applications where documents from several customers should be processed all together (e.g. in a print or scan flow).
5.2 Implementation details
Here we summarize the experimental details of our study.
For the Runlength histograms (RL) descriptor, we use a 5-layer pyramid () and 11 quantization levels (1, 2, 4, …, 512, larger than 512) yielding to 10648 dimensional features. Images were binarized (when necessary) and rescaled to 250K pixels as suggested in . These features, which are truly dataset-independent, we use in both experimental parts.
Our Fisher-Vector descriptors are built on top of SIFT features extracted at 5 different scales (the patch size varies from to in images rescaled to 250K pixels). The original SIFT features are projected using PCA to a 77-dimensional vector to which we concatenate the position (x,y) and the scale (s) of the patch, obtaining 80-dimensional local features. We consider different visual vocabulary sizes (4, 16 and 256 Gaussians in the mixture). For a fair comparison, the grid of the spatial pyramid varies in order to build FVs of the same dimension (40960). For a vocabulary with 4 Gaussians, we concatenate FVs on an grid (denoted by FV4), for a vocabulary with 16 Gaussians, we concatenate FVs on a grid (denoted by FV16) and for a vocabulary of size 256 we use the FV build on the whole image (denoted by FV256). Given a new dataset, we can either compute new SIFT-PCA and GMM to build the FVs, or reuse the models (PCA and GMM) unsupervisedly trained on the RVL-CDIP dataset. We opted for the second strategy for two reasons: first, preliminary experiments have shown very similar results, and second, our study aims at testing the transferability of the models to new datasets.
Both the RL and FV were considered in the unsupervised part of the hybrid architecture (see Section 3.4). We refer to them as FV+MLP and RL+MLP respectively. Note that for each feature type (e.g. FV256 or FV16) we need to build a different hybrid model. In addition to the model proposed in  where the size of the original FV is first reduced with PCA (to 4096 dimensions), we also build hybrid models directly on the FV without PCA reduction. In this case we fix the first fully connected layer to a size of 4096 letting the hybrid model learn the dimensionality reduction. By default, results reported for our hybrid models do not include PCA. When we do, this is mentioned explicitly (FV+PCA+MLP). In the experiments exploring feature transferability (Section 6.2), we use the activation features corresponding to various fully connected layers of these models trained on the RVL-CDIP dataset (see details in Sections 3.4 and 4.2).
We consider two popular CNN architectures that were successfully used to classify natural images: AlexNet and GoogLeNet (see Section 3.3) denoted by CNN-A and CNN-G respectively. For both models, we initialize the CNN with the models (available online) trained on the ImageNet classification challenge dataset  (ILSVRC 2012), and fine-tune them on the RVL-CDIP dataset. We also conducted experiments where the models were directly trained on RVL-CDIP, but the results were 1-2% below the fine-tuned version. As above, for the feature transferability experiments (Part 2) we considered activation features corresponding to the models fine-tuned on the RVL-CDIP dataset (see details in Sections 3.3 and 4.2).
The experiments are divided into two parts. In the first part, Section 6.1, our set of experiments are related to large scale document image classification using the RVL-CDIP dataset. The second part, Section 6.2, is devoted to our feature transfer experiments, where we explore how transferable different image representations, learned on the RVL-CDIP dataset, are to new datasets and tasks without any extra learning or fine-tuning of the parameters.
6.1 Part 1: Classification of documents from the same dataset
The first part of our experimental analysis focuses on the document image classification task. We benchmark the different feature representations introduced in Section 3 on the RVL-CDIP dataset. We followed the experimental protocol (train/val/test split and evaluation measure) suggested in . We used the validation set to choose both the classifier’s parameters (learning rate, number of iterations) as well as the model’s parameters (e.g. number of layers, drop-out level, etc.). First, we compare the best flavor of each descriptor type, to give a clear summary of the results, then we show deeper analyses for the different models.
6.1.1 Overall comparison
Table 2 summarizes top-1 accuracy on the RVL-CDIP dataset for the best version of each flavor of features that we consider in our benchmark. We can make the following observations.
First, we notice the good performances of CNN models. Both CNN-A and CNN-G outperform other descriptors. The CNN-A results are consistent with state-of-the art results on the RVL-CDIP dataset from  that reports 89.9% top-1 for its holistic AlexNet-based CNN. By using a better CNN architecture (GoogLeNet), we manage to improve over state-of-the art results and get 90.7% top-1 accuracy.
Second, the hybrid architecture based on Fisher-Vectors (FV+MLP) yields to a performance that is very close to CNN-A. This is an interesting observation as these models are much faster to train than the CNNs, and no GPU is required. More generally, we can observe the strong performance gain (+9.2% for RL and +4.2% for FV) that is brought by the hybrid architecture compared to these features used in their shallow version and combined with an SVM classifier.
Last and not surprisingly, these experiments confirm previous observations from  that FV features outperform RL features on the document image classification task (both using SVM and MLP).
6.1.2 Deeper Analysis
In this section, we study the parameters of the representations, showing that some of them play a crucial role in improving classification accuracy.
The vocabulary size for the FV
In Table 3 we compare the different FV representations whose visual vocabulary varies between 4 and 256 Gaussian in the GMM. For these experiments the grid structure of the spatial pyramid is adjusted to compare representations of equal length. These representations are combined either with an SVM classifier or used within the hybrid architecture (i.e. MLPs). We compare FV4 with a vocabulary of 4 Gaussians and an grid, FV16 with a vocabulary of 16 Gaussians and a grid, and finally FV256 with 256 Gaussians and no spatial pyramid. We can see that while FV256 and FV16 are on par when we use SVM, the hybrid model built on FV16 performs slightly better. Also, we observe that FV4 performs worse than the other descriptors in both cases, showing that the vocabulary needs to be expressive enough. Therefore, we do not report further results with FV4.
Number of hidden layers in the hybrid architecture
We first look at the modified hybrid architecture that we proposed, i.e. which does not apply PCA to the FV representations in the unsupervised part. Table 4 compares several hybrid architectures. On top of either FV256 or FV16 we use a varying number of hidden layers, building increasingly deep architectures. We observe that even with a small number of layers, we obtain good performances. Moreover, even a single layer already achieves better results than the FV+SVM strategy. Note that all hidden layers have their size fixed to 4096 but we varied the level of drop out. Best results were obtained in general with a drop out level of 30% or 40%.
Influence of PCA in the hybrid architecture
We modified the original hybrid model of  to remove the PCA projection and to integrate the dimensionality reduction in the first fully connected layer of the supervised part of the architecture. In that case, the input of the fully connected layer is the FV without PCA reduction. In the last raw of the Table 4, we compare the previous results with the original model (built on top of PCA reduced FVs), still varying the number of hidden layers . We observe that except for the single layer case (), the proposed hybrid architecture that does not perform PCA but discriminatively learns a dimension reduction performs better.
6.2 Part 2: Transfer of features to different datasets and tasks
In this section we explore how transferable the models and the related features are to new datasets and tasks. We target three different tasks: i) retrieval, ii) clustering and iii) (non parametric) classification. For all the experiments, we assume that the models generating the features (except for RL that needs no extra model) have been trained on the RVL-CDIP dataset, and we apply them to one of the six remaining datasets.
For all three tasks, we randomly split the datasets in halves, the first set for training, and the second set for testing. This is done fives times, and we report averaged results over the five splits. To asses the performance for a given split we proceed as follows.
Each test example is considered in turn as query example and the documents in the training set are ranked according to their similarity to the query. As our features are L2 normalized, we used the dot product as similarity measure for all features. To asses the retrieval performance, we use mean average precision (mAP). We also computed precision at 1 (P@1) and at 5 (P@5) by averaging the corresponding precision over all query examples, but as they exhibited similar behavior, we only report the mAP results.
For each split, we cluster samples from the training set using hierarchical clustering with centroid-linkage into as many clusters as the number of classes we have in the dataset. To evaluate the quality of the clustering we consider three different measures: the adjusted mutual information between true class labels and cluster labels, ii) the adjusted Random Index , and iii) the V-measure 
(which is the weighted harmonic mean of homogeneity and completeness). As we observed similar trends for these three measures, we only report results with the adjusted mutual information (AMI). Note that other clustering algorithms and different numbers of clusters could have lead to better performances, however here we are not interested on the clustering algorithm itself, but on comparing the different features in a similar setting.
We consider the Nearest Classification Mean (NCM) classifier  in our classification experiments as it is a non-parametric classifier. In the case of the NCM classifier, each class is represented by the centroid (class mean) of its training examples and a test element is assigned to the class of the closest centroid. We report overall classification accuracy (number of correctly classified test documents divided by the number of test documents). We could have considered -NN classifiers instead, however as the retrieval accuracy P@1 is equivalent to the -NN classification accuracy with , the retrieval experiments already give an idea of its behavior (see above).
In what follows, we first explore the best performing models and parameter configurations of each feature type (shallow, deep, and hybrid), then we present overall comparisons, and we finally discuss the results for each dataset individually.
6.2.1 CNN features
It is very common to use the activation of a CNN model trained on a dataset as “off-the-shelf” features for another dataset . In this section, we compare activation features extracted from several activation layers, for both CNN architectures (see details in Sections 3.3 and 4.2). In the case of AlexNet (CNN-A), we consider the 3 most popular layers for that task: pool5, fc6 and fc7, that are activation features of the last pooling layer, and of the two fully connected layers respectively. Both fc6 and fc7 have 4096 dimensions, and the pool5 feature is 9216-dimensional. In the case of GoogLeNet, we consider activations from the different inception layers i3a, i3b and i4a, i4e, i5b, which are features with their dimensions equal to 200704, 376320, 100352, 163072 and 50176 respectively. We also consider the activations from the average pooling layer that follows the last inception layer, denoted by p5s1, which has 1024 dimensions.
We report retrieval results in Table 5. In addition we also report clustering and classification results in Table 6 and Table 7 for the activation features best performing on the retrieval task. Based on these three tables we make the following observations.
|RL + MLP||100||34.9||36.4||66.5||67.4||60.7|
|FV256 + MLP||96.5||35.9||43.6||74.1||64.0||64.8|
|FV16 + MLP||99.7||32.1||44.0||76.4||66.3||65.7|
First, for all three tasks, the best results are in general obtained with inception layers i4a and i4e of the GoogLeNet network except for the MARG and IH2 dataset where the pool5 layer of AlexNet (CNN-A-p5) outperforms in general the results obtained with the different GoogLeNet activation features. This can be explained by the fact that the latter features capture higher-level semantic information, that are well aligned with the different categories these datasets are composed of. On the other hand, the categories from MARG and IH2 are more correlated with the layout than with the document semantics, and the pool5 convolutional layer of the AlexNet (CNN-A-p5) better captures the local geometry. Surprisingly, CNN-A-p5 outperforms significantly CNN-A-fc6 and CNN-A-fc7 on all datasets, not only on MARG, meaning that the latter features does not transfer well in the context of document images. One explanation might be the low number of classes (16) in the RVL-CDIP used to train the models.
|RL + MLP||99.2||6.2||26.7||58.4||52.5||45.5|
|FV256 + MLP||94.2||6.5||35.6||67.7||45.0||63.3|
|FV16 + MLP||97.7||9.1||39.1||74.6||54.9||63.5|
6.2.2 Shallow and hybrid features
For these experiments, we consider the RunLength descriptor with spatial pyramid (RL), the two Fisher-Vector-based descriptors with respectively 16 and 256 Gaussians (FV256 and FV16), without and with and the corresponding hybrid architecture (MLP). In addition, we consider the PCA-projected FV256 both as shallow feature and the activation features from its hybrid architecture.
In all cases, we select the MLP model that performs best on the RVL-CDIP validation set (see Section in 6.2) and use the activation values from the fully connected layers as feature representations, similarly to what is usually done when using CNN models as “off-the-shelf” features. By design, all these descriptors are 4096-dimensional. When using them in our three target applications, we observe that in most cases the activation features corresponding to the first fully connected layer outperform the activation features of the following layers. Therefore we decided to only show results obtained with the first fully connected layer.
We show results both with the shallow features and the corresponding hybrid features in Table 8 for retrieval, Table 9 for clustering, and Table 10 for categorization. Best results per dataset are shown in bold.
We observe that unlike for the classification task, when used in transfer, the advantage of hybrid architectures is less obvious. To make this easier to observe from the tables, we underline the cases where the hybrid activation feature outperforms its corresponding shallow feature. The results are somewhat mixed depending on features, tasks and datasets. The activation feature of the hybrid model learned on RL is almost always better than the original RL feature. In the case of the FV16 and FV256 the hybrid model sometimes brings a gain (especially on clustering results) but in other cases using directly the shallow features performs better. If we consider the PCA reduced FV256, using the hybrid model most often degrades the performance.
Overall, best retrieval results and most often best NCM classification accuracies are obtained with FV256+PCA. Concerning clustering, FV16+MLP significantly outperforms FV256+PCA for three datasets out of six.
|RL + MLP||100||53.4||63.1||91.4||79.1||84.9|
6.2.3 Comparing shallow and deep features
Finally, in Table 11, we summarize all the best results obtained with shallow, deep, and hybrid features and analyze them dataset per dataset.
|FV16 + MLP||99.7||32.1||44.0||76.4||66.3||65.7|
|FV16 + MLP||97.7||9.1||39.1||74.6||54.9||63.5|
|FV16 + MLP||100||53.7||64.5||93.1||84.1||88.4|
This dataset is much easier than the other ones, as the appearance is very consistent within a given category ,and categories are well-aligned with specific templates. Consequently, most methods perform really well on all tasks.
This dataset is much more challenging as category labels were defined based on specific aspects of the document layout (such as the presence and the location of the title, affiliation or the abstract), while other aspects of the layout are totally ignored (e.g. the number of columns in the document). Consequently, there is large intra-class variation and visually similar documents can belong to different categories. This is illustrated in Figure 10. The left part of the figure displays visually dissimilar documents from the same category, while the right part displays a cluster of visually similar documents that belong to different categories (each document represents one of the classes). This could explain the very low clustering results obtained, independently of the visual feature used. Regarding the retrieval and NCM classification tasks, best results are obtained with pool5 activation features from AlexNet (CNN-A-p5), however the performance obtained with FV256+PCA are close to these results and better than the results obtained with GoogLeNet activation features.
This dataset departs from the others in two aspects. First, the size and aspect ratio of the images varies a lot, which might have a strong impact on CNN representations that use a fixed size and aspect ratio as input. Second, there is a very large intra-class variability and the document layout has small or even no importance in the category definition (see examples in Figure 8). This might explain why FV256+PCA outperforms by a large margin CNN activation features; the former are explicitly designed to work with geometry-less bags of local features, and consequently they better capture local information disregarding its position (see e.g. flowchart components or mathematical symbols in formulas from Figure 8). Qualitative results can be seen in Figures 11 and 12. These figures display randomly chosen queries and the corresponding top retrieval results obtained with FV256+PCA and CNN-G-i4e respectively.
This dataset is probably the one most similar to the RVL-CDIP dataset, on which the feature representations have been trained. Indeed, both datasets share classes, such as invoices, contracts, etc. However, the IH1 dataset also consists of sub-classes (e.g. invoice type 1 and invoice type 2). On this dataset, for the classification task, the different methods obtain similar accuracies, CNN-G-i4e features being the best. This feature yields also the best clustering performance but is outperformed on the retrieval task by FV256+PCA. The relatively good and similar performances obtained with the CNN and hybrid activation features is probably due to the closeness between the classes and images of the RVL-CDIP dataset, used to train the models, and the IH1 dataset.
This is a small fine-grained dataset (72 categories) where the document layout (e.g. ”page with two tables, one on the top and one on the bottom”), plays a crucial role in the category definition. This property seems to have been better captured by CNN activation features that keep geometric information compared to FV256+PCA or FV16+MLP that are less dependent on the layout. The importance to capture the geometry for this dataset can be seen also by a deeper analyses of the Tables 8, 9 and 10 where we can see that FV16 (with its spatial grid) outperforms FV256. It can also be seen on retrieval and clustering where even RL (with 5 layered spatial pyramid) outperforms FV256. In summary, on this dataset, there is no obvious best performing feature, CNN-A-p5 performs the best for clustering and NCM classification, but it is outperformed by both CNN-G-i4a and CNN-G-i4e on the retrieval task.
The last in-house dataset contains 63 fine-grained categories such as diverse forms, invoices, contracts where each form/contract/invoice coming from a different customer corresponds to a different class. It can be seen as a mix between NIST (as some classes are variations of templates) and IH1 (other classes are more generic with intra-class variations, and we have also sub-classes for several of them). On this dataset best or close to best results were obtained with the CNN-G-i4e activation features of the GoogLeNet. We show some retrieval examples in Figure 13 where we compare the top results for this feature with the top results obtained with FV256+PCA. Note that the class label differences often come from the fact that the document belongs to different customers, which explains that while most retrieved documents are of the same generic type as the query (e.g. drawing, handwritten letter, printed code) not all of them are considered as relevant to the query (provided by different customers they belong to different classes).
This paper proposes a detailed benchmark that compares three types of document image representation: so-called shallow features, such as the RunLength and the Fisher-Vector descriptors, deep features based on Convolutional Neural Networks, and features extracted from hybrid architectures that take inspiration from the two previous ones. Our benchmark first compares these features on a classification task where the training and testing sets belong to the same domain. It also compares these features when used to represent documents from other domains, for three different tasks, in order to quantify how much these different document image representations generalize across datasets and tasks.
We observed that without domain shift, Convolutional Neural Network features perform better than shallow and hybrid features, closely followed by hybrid architectures that perform almost as well for a fraction of the training cost. This had already been observed for natural images, and we confirmed this observation for document images.
In presence of a domain shift, the story changes quite significantly. Independently of the targeted task (we considered retrieval, clustering, and classification), the hybrid architectures do not transfer well in general across datasets. Instead, deep or shallow features are the best, depending on the dataset specificities. On one hand, Convolutional Neural Networks seems to perform the best for target datasets that are not too different from the source dataset, and for datasets for which the global layout is important. On the other hand, PCA reduced FVs appears to better deal with strong aspect-ratio changes and very large intra-class variability on the document layout.
-  Artem Babenko, Anton Slesarev, Alexander Chigorin, and Victor S. Lempitsky. Neural codes for image retrieval. In ECCV, pages 584–599, 2014.
-  A.D. Bagdanov and M. Worring. Multiscale document description using rectangular granulometries. International Journal on Document Analysis and Recognition, 6:181–191, 2004.
-  Herbert Bay, Tinne Tuytelaars, and Luc Gool. SURF: Speeded Up Robust Features. In ECCV, 2006.
Large-scale machine learning with stochastic gradient descent.In COMPSTAT, pages 177–186, 2010.
-  Y.-K. Chan and C.-C. Chang. Image matching using run-length feature. Pattern Recognition Letters, 22:447–455, 2001.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: delving deep into convolutional nets. In BMVC, 2014.
-  N. Chen and D. Blostein. A survey of document image classification: problem statement, classifier architecture and performance evaluatio. International Journal on Document Analysis and Recognition, 10:1–16, 2007.
-  G. Csurka. Document image classification, with a specific view on applications of patent images. CoRR, arXiv:1601.03295, 2016.
-  G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning for Computer Vision, volume 1, pages 1–2, 2004.
-  J.F. Cullen, J.J.H. Jonathan, and P.E. Hart. Document image database retrieval and browsing using texture analysis. In ICDAR, volume 2, pages 718–721, 1997.
-  D. L. Dimmick, M. D. Garris, and C. L. Wilson. Structured forms database. Technical report, SFRS, National Institutte of Standards and Technology, 1991.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
-  G. Ford and G.R. Thoma. Ground truth data for document image analysis. In Symposium on Document Image Understanding and Technology, 2003.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
-  A. Gordo. Document Image Representation, Classification and Retrieval in Large-Scale Domains. PhD thesis, Computer Vision Center, Universitat Autònoma de Barcelona, 2013.
-  A. Gordo and F. Perronnin. A bag-of-pages approach to unordered multi-page document classification. In ICPR, 2010.
-  A Gordo, F Perronnin, and E Valveny. Document classification using multiple views. In DAS, pages 33–37, 2012.
-  A. Gordo, F. Perronnin, and E. Valveny. Large-scale document image retrieval and classification with runlength histograms and binary embeddings. Pattern Recognition, 46(7):1898–1905, 2013.
-  A. Gordo, M. Rusinol, D. Karatzas, and A.D. Bagdanov. Document classification and page stream segmentation for digital mailroom applications. In ICDAR, pages 621–625, 2013.
-  A. Harley, A. Ufkes, and K. Derpanis. Evaluation of deep convolutional nets for document image classification and retrieval. In ICDAR, pages 991–995, 2015.
-  P. Heroux, S. Diana, A. Ribert, and E. Trupin. Classification method study for automatic form class identification. In ICPR, volume 1, pages 926–928, Aug 1998.
-  L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985.
-  G. Joutel, V. Eglin, S. Bres, and H. Emptoz. Curvelets Based Queries for CBIR Application in Handwriting Collections. In ICDAR, volume 2, pages 649–653, 2007.
-  L. Kang, J. Kumar, P.Ye, Y. Liy, and D. Doermann. Convolutional neural networks for document image classification. In ICPR, pages 3168–3172, 2014.
-  D. Keysers, F. Shafait, and T.M. Breuel. Document image zone classification - a simple high-performance approach. In VISAPP, pages 44–51, 2007.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  Jayant Kumar, Peng Ye, and David Doermann. Structural similarity for document image classification and retrieval. Pattern Recognition Letters, 43:119–126, 2014.
-  S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169–2178, 2006.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
-  D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J.Heard. Building a test collection for complex document information processing. In SIGIR, pages 665–666, 2006.
-  D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
-  T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-Based Image Classification: Generalizing to new classes at near-zero cost. Transactions on Pattern Analysis and Machine Intelligence, 35(11):2624–2637, 2013.
-  M. Rusi nol, V. Frinken, D. Karatzas, A.D. Bagdanov, and J. Llados. Multimodal page classification in administrative document image streams. International Journal on Document Analysis and Recognition, 17:331–341, 2014.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pages 1717–1724, 2014.
-  F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, pages 1–8, 2007.
-  F. Perronnin and D. Larlus. Fisher vectors meet neural networks: A hybrid classification architecture. In CVPR, pages 3743–3752, 2015.
-  F. Perronnin, J. Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, pages 143–156, 2010.
-  F. Piroi, M. Lupu, A. Hanbury, and V. Zenz. CLEF-IP 2011: Retrieval in the Intellectual Property Domain. In Intellectual Property Evaluation Campaign (CLEF-IP), 2011.
-  I. Pratikakis, B. Gatos, and K. Ntirogiannis. Icfhr 2012 competition on handwritten document image binarization. In ICFHR, 2012.
-  Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In CVPR Deep Vision Workshop, 2014.
-  K. V. Umamaheswara Reddy and Venu Govindaraju. Form classification. In SPIE Document Recognition and Retrieval, volume 6815, pages 1–6, 2008.
-  A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, pages 410–420, 2007.
-  M. Rusinol, D. Karatzas, A.D. Bagdanov, and J. Llados. Multipage document retrieval by textual and visual representations. In ICPR, 2012.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  P. Sarkar. Image classification: Classifying distributions of visual features. In ICPR, volume 2, pages 472–475, 2006.
-  C. Shin, D. Doermann, and A. Rosenfeld. Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition, 3:232–247, 2001.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
-  J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
-  Vladimir Vapnik. Statistical learning theory. Wiley-Interscience, 1998.
-  Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073–1080, 2009.
-  M.D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV. 2014.