Image classification and annotation are two important tasks in computer vision. In image classification, one tries to describe the image globally with a single descriptive label (such ascoast, outdoor, inside city, etc.), while annotation focuses on tagging the local content within the image (such as whether it contains “sky”, a “car”, a “tree”, etc.). Since these two problems are related, it is natural to attempt to solve them jointly. For example, an image labeled as street is more likely to be annotated with “car”, “pedestrian” or “building” than with “beach” or “see water”. Although there has been a lot of work on image classification and annotation separately, less work has looked at solving these two problems simultaneously.
Work on image classification and annotation is often based on a topic model, the most popular being latent Dirichlet allocation or LDA 
. LDA is a generative model for documents that originates from the natural language processing community but that has had great success in computer vision for scene modeling[1, 2]. LDA models a document as a multinomial distribution over topics, where a topic is itself a multinomial distribution over words. While the distribution over topics is specific for each document, the topic-dependent distributions over words are shared across all documents. Topic models can thus extract a meaningful, semantic representation from a document by inferring its latent distribution over topics from the words it contains. In the context of computer vision, LDA can be used by first extracting so-called “visual words” from images, convert the images into visual word documents and training an LDA topic model on the bags of visual words. Image representations learned with LDA have been used successfully for many computer vision tasks such as visual classification [3, 4], annotation [5, 6]
and image retrieval[7, 8].
Although the original LDA topic model was proposed as an unsupervised learning method, supervised variants of LDA have been proposed[9, 2]. By modeling both the documents’ visual words and their class labels, the discriminative power of the learned image representations could thus be improved.
At the heart of most topic models is a generative story in which the image’s latent representation is generated first and the visual words are subsequently produced from this representation. The appeal of this approach is that the task of extracting the representation from observations is easily framed as a probabilistic inference problem, for which many general purpose solutions exist. The disadvantage however is that as a model becomes more sophisticated, inference becomes less trivial and more computationally expensive. In LDA for instance, inference of the distribution over topics does not have a closed-form solution and must be approximated, either using variational approximate inference or MCMC sampling. Yet, the model is actually relatively simple, making certain simplifying independence assumptions such the conditional independence of the visual words given the image’s latent distribution over topics.
Recently, an alternative generative modeling approach for documents was proposed by Larochelle and Lauly 
. Their model, the Document Neural Autoregressive Distribution Estimator (DocNADE), models directly the joint distribution of the words in a document, by decomposing it through the probability chain rule as a product of conditional distributions and modeling each conditional using a neural network. Hence, DocNADE doesn’t incorporate any latent random variables over which potentially expensive inference must be performed. Instead, a document representation can be computed efficiently in a simple feed-forward fashion, using the value of the neural network’s hidden layer.Larochelle and Lauly  also show that DocNADE is a better generative model of text documents and can extract a useful representation for text information retrieval.
In this paper, we consider the application of DocNADE in the context of computer vision. More specifically, we propose a supervised variant of DocNADE (SupDocNADE), which models the joint distribution over an image’s visual words, annotation words and class label. The model is illustrated in Figure 1. We investigate how to successfully incorporate spatial information about the visual words and highlight the importance of calibrating the generative and discriminative components of the training objective. Our results confirm that this approach can outperform the supervised variant of LDA and is a competitive alternative for scene modeling.
2 Related Work
Simultaneous image classification and annotation is often addressed using models extending the basic LDA topic model. Wang et al.  proposed a supervised LDA formulation to tackle this problem. Wang and Mori  opted instead for a maximum margin formulation of LDA (MMLDA). Our work also belongs to this line of work, extending topic models to a supervised computer vision problem: our contribution is to extend a different topic model, DocNADE, to this context.
What distinguishes DocNADE from other topic models is its reliance on a neural network architecture. Neural networks are increasingly used for the probabilistic modeling of images (see  for a review). In the realm of document modeling, Salakhutdinov and Hinton  proposed a Replicated Softmax model for bags of words. DocNADE is in fact inspired by that model and was shown to improve over its performance while being much more computationally efficient. Wan et al. 
also proposed a hybrid model that combines LDA and a neural network. They applied their model to scene classification only, outperforming approaches based on LDA or on a neural network only. In our experiments, we show that our approach outperforms theirs. Generally speaking, we are not aware of any other work which has considered the problem of jointly classifying and annotating images using a hybrid topic model/neural network approach.
3 Document NADE
In this section, we describe the original DocNADE model. In Larochelle and Lauly , DocNADE was use to model documents of real words, belonging to some predefined vocabulary. To model image data, we assume that images have first been converted into a bag of visual words. A standard approach is to learn a vocabulary of visual words by performing -means clustering on SIFT descriptors densely exacted from all training images. See Section 5.2 for more details about this procedure. From that point on, any image can thus be represented as a bag of visual words , where each is the index of the closest -means cluster to the SIFT descriptor extracted from the image and is the number of extracted descriptors.
DocNADE models the joint probability of the visual words by rewritting it as
and modeling instead each conditional , where is the subvector containing all such that . Notice that Equation 1 is true for any distribution, based on the probability chain rule. Hence, the main assumption made by DocNADE is in the form of the conditionals. Specifically, DocNADE assumes that each conditional can be modeled and learned by a feedforward neural network.
One possibility would be to model with the following architecture:
is an element-wise non-linear activation function,and are the connection parameter matrices, and
are bias parameter vectors andare the number of hidden units (topics) and vocabulary size, respectively.
Computing the distribution of Equation 2 requires time linear in . In practice, this is too expensive, since it must be computed for each of the visual words . To address this issue, Larochelle and Lauly  propose to use a balanced binary tree to decompose the computation of the conditionals and obtain a complexity logarithmic in . This is achieved by randomly assigning all visual words to a different leaf in a binary tree. Given this tree, the probability of a word is modeled as the probability of reaching its associated leaf from the root. We model each left/right transition probabilities in the binary tree using a set of binary logistic regressors taking the hidden layer as input. The probability of a given word can then be obtained by multiplying the probabilities of each left/right choices of the associated tree path.
Specifically, let be the sequence of tree nodes on the path from the root to the leaf of and let be the sequence of binary left/right choices at the internal nodes along that path. For example, will always be the root node of the binary tree, and will be if the word leaf is in the left subtree or otherwise. Let
now be the matrix containing the logistic regression weights andbe a vector containing the biases, where is the number of inner nodes in the binary tree and is the number of hidden units. The probability is now modeled as
are the internal node logistic regression outputs and
is the sigmoid function. By using a balanced tree, we are guaranteed that computing Equation3 involves only logistic regression outputs. One could attempt to optimize the organization of the words within the tree, but a random assignment of the words to leaves works well in practice .
of DocNADE, we simply optimize the average negative log-likelihood of the training set documents using stochastic gradient descent. Once the model is trained, a latent representation can be extracted from a new documentas follows:
This representation could be fed to a standard classifier to perform any supervised computer vision task. The index is used to highlight that it is the representation used to predict the class label of the image.
Equations 3,4 indicate that the conditional probability of each word requires computing the position dependent hidden layer , which extracts a representation out of the bag of previous visual words . Since computing is in on average, and there are hidden layers to compute, then a naive procedure for computing all hidden layers would be in .
However, noticing that
and exploiting that fact that the weight matrix
is the same across all conditionals, the linear transformationcan be reused from the computation of the previous hidden layer to compute . With this procedure, computing all hidden layers sequentially from to becomes in .
Finally, since the computation complexity of each of the logistic regressions in Equation 3 is , the total complexity of computing is . In practice, the length of document and the number of hidden units tends to be small, while will be small even for large vocabularies. Thus DocNADE can be used and trained efficiently.
4 SupDocNADE for Image Classification and Annotation
In this section, we describe the approach of this paper, inspired by DocNADE, to simultaneously classify and annotate image data. First, we describe a supervised extension of DocNADE (SupDocNADE), which incorporates class label information into training to learn more discriminative hidden features for classification. Then we describe how we exploit the spatial position information of the visual words. At last, we describe how to also perform annotation, along with classification, using SupDocNADE.
4.1 Supervised DocNADE
It has been observed that learning image feature representations using unsupervised topic models such as LDA can perform worse than training a classifier directly on the visual words themselves, using an appropriate kernel such as a pyramid kernel . One reason is that the unsupervised topic features are trained to explain as much of the entire statistical structure of images as possible and might not model well the particular discriminative structure we are after in our computer vision task. This issue has been addressed in the literature by devising supervised variants of LDA, such as Supervised LDA or sLDA . DocNADE also being an unsupervised topic model, we propose here a supervised variant of DocNADE, SupDocNADE, in an attempt to make the learned image representation more discriminative for the purpose of image classification.
Specifically, given an image and its class label , SupDocNADE models the full joint distribution as
As in DocNADE, each conditional is modeled by a neural network. We use the same architecture for as in regular DocNADE. We now only need to define the model for .
Since is the image representation that we’ll use to perform classification, we propose to model as a multiclass logistic regression output computed from :
where , is the bias parameter vector in the supervised layer and is the connection matrix between hidden layer and the class label.
Put differently, is modeled as a regular multiclass neural network, taking as input the bag of visual words . The crucial difference however with a regular neural network is that some of its parameters (namely the hidden unit parameters and ) are also used to model the visual word conditionals .
Maximum likelihood training of this model is performed by minimizing the negative log-likelihood
averaged over all training images. This is known as generative learning . The first term is a purely discriminative term, while the second is unsupervised and can be understood as a regularizer, that encourages a solution which also explains the unsupervised statistical structure within the visual words. In practice, this regularizer can bias the solution too strongly away from a more discriminative solution that generalizes well. Hence, similarly to previous work on hybrid generative/discriminative learning, we propose instead to weight the importance of the generative term
where is treated as a regularization hyper-parameter.
Training on the training set average of Equation 10
is performed by stochastic gradient descent, using backpropagation to compute the parameter derivatives. As in regular DocNADE, computation of the training objective and its gradient requires that we define an ordering of the visual words. Though we could have defined an arbitrary path across the image to order the words (e.g. from left to right, top to bottom in the image), we followLarochelle and Lauly  and randomly permute the words before every stochastic gradient update. The implication is that the model is effectively trained to be a good inference model of any conditional , for any ordering of the words in . This again helps fighting against overfitting and better regularizes our model.
In our experiments, we used the rectified linear function as the activation function
which often outperforms other activation functions  and has been shown to work well for image data . Since this is a piece-wise linear function, the (sub-)gradient with respect to its input, needed by backpropagation to compute the parameter gradients, is simply
where is 1 if is true and 0 otherwise. Algorithms 1 and 2 give pseudocodes for efficiently computing the joint distribution and the parameter gradients of Equation 10 required for stochastic gradient descent training.
4.2 Dealing with Multiple Regions
Spatial information plays an important role for understanding an image. For example, the sky will often appear on the top part of the image, while a car will most often appear at the bottom. A lot of previous work has exploited this intuition successfully. For example, in the seminal work on spatial pyramids , it is shown that extracting different visual word histograms over distinct regions instead of a single image-wide histogram can yield substantial gains in performance.
We follow a similar approach, whereby we model both the presence of the visual words and the identity of the region they appear in. Specifically, let’s assume the image is divided into several distinct regions , where is the number of regions. The image can now be represented as
where is the region from which the visual word was extracted. To model the joint distribution over these visual words, we decompose it as and treat each possible visual word/region pair as a distinct word. One implication of this is that the binary tree of visual words must be larger so as to have a leaf for each possible visual word/region pair. Fortunately, since computations grow logarithmically with the size of the tree, this is not a problem and we can still deal with a large number of regions.
4.3 Dealing with Annotations
The annotation of an image consists in a list of words111Annotations contain multiword expressions as well such as person walking, but they are treated as a single token. describing the content of the image. For example, in the image of Figure 1, the annotation might contain the words “trees” or “people”. Because annotations and labels are clearly dependent, we try to model them jointly within our SupDocNADE model.
Specifically, let be the predefined vocabulary of all annotation words, we will note the annotation of a given image as where , with being the number of words in the annotation. Thus, the image with its annotation can be represented as a mixed bag of visual and annotation words:
To embed the annotation words into the SupDocNADE framework, we treat each annotation word the same way we deal with visual words. Specifically, we use a joint indexing of all visual and annotation words and use a larger binary word tree so as to augment it with leaves for the annotation words. By training SupDocNADE on this joint image/annotation representation , it can learn the relationship between the labels, the spatially-embedded visual words and the annotation words.
At test time, the annotation words are not given and we wish to predict them. To achieve this, we compute the document representation based only on the visual words and compute for each possible annotation word the probability that it would be the next observed word , based on the tree decomposition as in Equation 3. In other words, we only compute the probability of paths that reach a leaf corresponding to an annotation word (not a visual word). We then rank the annotation words in in decreasing order of their probability and select the top 5 words as our predicted annotation.
5 Experiments and Results
In this section, we test our model on 3 real-world datasets: a subset of the LabelMe dataset , the UIUC-Sports dataset  and the Scene15 dataset . Scene15 is used to evaluate image classification performance only, while LabelMe and UICU-Sports come with annotations and is a popular classification and annotation benchmark. We provide a quantitative comparison between SupDocNADE, the original DocNADE model and supervised LDA (sLDA) [9, 2]. The code to download the datasets and for SupDocNADE is available at http://www.anonymous.com.
5.1 Datasets Description
The Scene15 dataset contains 4485 images, belonging to 15 different classes. Following previous work, we first resize the images so the maximum side (length or width) is 300 pixels wide, without changing the aspect ratio. For each experiment, we randomly select 100 images as the training set, using the remaining images for the test set.
Following Wang et al. , we constructed our LabelMe dataset using the online tool to obtain images of size pixels from the following 8 classes: highway, inside city, coast, forest, tall building, street, open country and mountain. For each class, 200 images were randomly selected and split evenly in the training and test sets, yielding a total of 1600 images.
The UIUC-Sports dataset contains 1792 images, classified into 8 classes: badminton (313 images), bocce (137 images), croquet (330 images), polo (183 images), rockclimbing (194 images), rowing (255 images), sailing (190 images), snowboarding (190 images). Following previous work, the maximum side of each image was resized to 400 pixels, while maintaining the aspect ratio. We randomly split the images of each class evenly into training and test sets. For both LabelMe and UIUC-Sports datasets, we removed the annotation words occurring less than 3 times, as in Wang et al. .
5.2 Experimental Setup
Following Wang et al. , 128 dimensional, densely extracted SIFT features were used to extract the visual words. The step and patch size of the dense SIFT extraction was set to 8 and 16, respectively. The dense SIFT features from the training set were quantized into 240 clusters, to construct our visual word vocabulary, using -means. We divided each image into a grid to extract the spatial position information, as described in Section 4.2. This produced different visual word/region pairs.
We use classification accuracy to evaluate the performance of image classification and the average F-measure of the top 5 predicted annotations to evaluate the annotation performance, as in previous work. The F-measure of an image is defined as
where recall is the percentage of correctly predicted annotations out of all ground-truth annotations for an image, while the precision is the percentage of correctly predicted annotations out of all predicted annotations222When there are repeated words in the ground-truth annotations, the repeated terms were removed to calculate the F-measure. We used 5 random train/test splits to estimate the average accuracy and F-measure.
Image classification with SupDocNADE is performed by feeding the learned document representations to a RBF kernel SVM. In our experiments, all hyper-parameters (learning rate, unsupervised learning weight in SupDocNADE, and in RBF kernel SVM), were chosen by cross validation. We emphasize that the annotation words are not available at test time and all methods predict an image’s class based solely on its bag of visual words.
5.3 Quantitative Comparison
In this section, we describe our quantitative comparison between SupDocNADE, DocNADE and sLDA. We used the implementation of sLDA available at http://www.cs.cmu.edu/~chongw/slda/ in our comparison. For models which did not have a publicly available implementation (hybrid topic/neural network model  and MMLDA ), we compare instead with the results reported in the literature.
5.3.1 Image Classification
We first test the classification performance of our method on the Scene15 dataset. Figure 2 illustrates the performance of all methods, for a varying number of topics. We observe that SupDocNADE outperforms sLDA by a large margin, while also improving over the orignal DocNADE model.
In Figure 2, we also compare SupDocNADE with other design choices for the model, such as performing purely generative () or purely discriminative () training, or ignoring spatial position information (i.e. using a single region, covering the whole image). We see that both using position information and tuning the weight are important, with pure discriminative learning performing worse.
Wan et al.  also performed experiments on the Scene15 dataset using their hybrid topic/neural network model, but used a slightly different setup: they used 45 topics, a visual word vocabulary of size 200, a dense SIFT patch size of and a step size of 16. They also didn’t incorporate spatial position information using a spatial grid. When running SupDocNADE using this configuration, we obtain a classification accuracy of , compared to for their model.
5.3.2 Simultaneous Classification and Annotation
We now look at the simultaneous image classification and annotation performance on LabelMe and UIUC-Sports datasets.
The classification results are illustrated in Figure 3. Similarly, we observe that SupDocNADE outperforms DocNADE and sLDA. Tuning the trade-off between generative and discriminative learning and exploiting position information is usually beneficial. There is just one exception, on LabelMe, with 200 hidden topic units, where using a grid slightly outperforms a grid.
As for image annotation, we computed the performance of our model with 200 topics. SupDocNADE obtains an -measure of and on the LabelMe and UIUC-Sports datasets respectively. This is slightly superior to regular DocNADE, which obtains and . Since code for performing image annotation using sLDA is not publicly available, we compare directly with the results found in the corresponding paper . Wang et al.  report -measures of and for sLDA, which is below SupDocNADE by a large margin.
We also compare with MMLDA , a max-margin formulation of LDA, that has been applied to image classification and annotation separately. The reported classification accuracy for MMLDA is (for LabelMe) and (for UIUC-Sports), which is less than SupDocNADE. As for annotation, -measures of (for LabelMe) and (for UIUC-Sports) are reported, which is better than SupDocNADE on LabelMe but worse on UIUC-Sports. We should mention that MMLDA did not address the problem of simultaneously classifying and annotating images, these tasks being treated separately.
Figure 5 illustrates examples of correct and incorrect predictions made by SupDocNADE on the LabelMe dataset.
5.4 Visualization of Learned Representations
Since topic models are often used to interpret and explore the semantic structure of image data, we looked at how we could observe the structure learned by SupDocNADE.
We tried to extract the visual/annotation words that were most strongly associated with certain class labels within SupDocNADE. For example, given a class label street, which corresponds to a column in matrix , we selected the top 3 topics (hidden units) having the largest connection weight in . Then, we averaged the columns of matrix corresponding to these 3 hidden topics and selected the visual/annotation words with largest averaged weight connection. The results of this procedure for classes street, sailing, forest and highway is illustrated in Figure 4. To visualize the visual words, we show 16 image patches belonging to each visual word’s cluster, as extracted by -means. The learned associations are intuitive: for example, the class street is associated with the annotation words “building”, “buildings”, “window”, “person walking” and “sky”, while the visual words showcase parts of buildings and windows.
6 Conclusion and Discussion
In this paper, we proposed SupDocNADE, a supervised extension of DocNADE. Like all topic models, our model is trained to model the distribution of the bag of words representation of images and can extract a meaningful representation from it. Unlike most topic models however, the image representation is not modeled as a latent random variable in a model, but instead as the hidden layer of a neural network. While the resulting model might be less interpretable (as typical with neural networks), it has the advantage of not requiring any iterative, approximate inference procedure to compute an image’s representation. Our experiments confirm that SupDocNADE is a competitive approach for the classification and annotation of images.
- Blei et al.  D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” JMLR, 2003.
- Wang et al.  C. Wang, D. Blei, and F.-F. Li, “Simultaneous image classification and annotation,” in CVPR, 2009.
- Lazebnik et al.  S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in CVPR, 2006.
- Yang et al.  J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in CVPR, 2009.
- Tsai  C.-F. Tsai, “Bag-of-words representation in image annotation: A review,”
- Weston et al.  J. Weston, S. Bengio, and N. Usunier, “Large scale image annotation: learning to rank with joint word-image embeddings,” Machine learning, 2010.
- Wu et al.  Z. Wu, Q. Ke, J. Sun, and H.-Y. Shum, “A multi-sample, multi-tree approach to bag-of-words image representation for image retrieval,” in ICCV, 2009.
- Philbin et al.  J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in CVPR, 2007.
- David M. Blei  J. D. M. David M. Blei, “Supervised topic models,” NIPS, 2007.
- Larochelle and Lauly  H. Larochelle and S. Lauly, “A neural autoregressive topic model,” in NIPS 25, 2012.
- Wang and Mori  Y. Wang and G. Mori, “Max-margin latent dirichlet allocation for image classification and annotation,” in BMVC, 2011.
- Bengio et al.  Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” arXiv preprint arXiv:1206.5538, 2012.
- Salakhutdinov and Hinton  R. Salakhutdinov and G. E. Hinton, “Replicated softmax: an undirected topic model,” NIPS, 2009.
- Wan et al.  L. Wan, L. Zhu, and R. Fergus, “A hybrid neural network-latent topic model,” 2012.
- Bouchard et al.  G. Bouchard, B. Triggs, et al., “The tradeoff between generative and discriminative classifiers,” in COMPSTAT, 2004.
- Glorot et al.  X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier networks,” in AISTATS, 2011.
- Nair and Hinton 
- Russell et al.  B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a database and web-based tool for image annotation,” IJCV, 2008.
- Li and Fei-Fei  L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition,” in ICCV, 2007.