A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data, TPAMI, http://arxiv.org/abs/1409.3970
Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to deal with multimodal data, such as in image annotation tasks. Another popular approach to model the multimodal data is through deep neural networks, such as the deep Boltzmann machine (DBM). Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for text document modeling. In this work, we show how to successfully apply and extend this model to multimodal data, such as simultaneous image classification and annotation. First, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the learned hidden topic features and show how to employ it to learn a joint representation from image visual words, annotation words and class label information. We test our model on the LabelMe and UIUC-Sports data sets and show that it compares favorably to other topic models. Second, we propose a deep extension of our model and provide an efficient way of training the deep model. Experimental results show that our deep model outperforms its shallow version and reaches state-of-the-art performance on the Multimedia Information Retrieval (MIR) Flickr data set.READ FULL TEXT VIEW PDF
Topic modeling based on latent Dirichlet allocation (LDA) has been a
Context information around words helps in determining their actual meani...
We present an approach based on feed-forward neural networks for learnin...
Despite many years of research into latent Dirichlet allocation (LDA),
We propose a multi-wing harmonium model for mining multimedia data that
In this thesis we examined several multimodal feature extraction and lea...
Many data sets contain rich information about objects, as well as pairwi...
A Deep and Autoregressive Approach for Topic Modeling of Multimodal Data, TPAMI, http://arxiv.org/abs/1409.3970
Multimodal data modeling, which combines information from different sources, is increasingly attracting attention in computer vision[1, 2, 3, 4, 5, 6, 7]. One of the leading approaches is based on topic modelling, the most popular model being latent Dirichlet allocation or LDA 
. LDA is a generative model for documents that originates from the natural language processing community, but has had great success in computer vision[8, 9]. LDA models a document as a multinomial distribution over topics, where a topic is itself a multinomial distribution over words. While the distribution over topics is specific for each document, the topic-dependent distributions over words are shared across all documents. Topic models can thus extract a meaningful, semantic representation from a document by inferring its latent distribution over topics from the words it contains. In the context of computer vision, LDA can be used by first extracting so-called “visual words” from images, convert the images into visual word documents and training an LDA topic model on the bags of visual words.
To deal with multimodal data, some variants of LDA have been proposed recently [2, 5, 4, 9]. For instance, Correspondence LDA (Corr-LDA)  was proposed to discover the relationship between images and annotation modalities, by assuming each image topic must have a corresponding text topic. Multimodal LDA  generalizes Corr-LDA by learning a regression module relating the topics from the different modalities. Multimodal Document Random Field Model (MDRF)  was also proposed to deal with multimodal data, which learns cross-modality similarities from a document corpus containing multinomial data. Besides the annotation words, the class label modality can also be embedded into LDA, such as in supervised LDA (sLDA) [10, 9]. By modeling the image visual words, annotation words and their class labels, the discriminative power of the learned image representations could thus be improved.
At the heart of most topic models is a generative story in which the image’s latent representation is generated first and the visual words are subsequently produced from this representation. The appeal of this approach is that the task of extracting the representation from observations is easily framed as a probabilistic inference problem, for which many general purpose solutions exist. The disadvantage however is that as a model becomes more sophisticated, inference becomes less trivial and more computationally expensive. In LDA for instance, inference of the distribution over topics does not have a closed-form solution and must be approximated, either using variational approximate inference or MCMC sampling. Yet, the model is actually relatively simple, making certain simplifying independence assumptions such as the conditional independence of the visual words given the image’s latent distribution over topics.
. This deep learning approach to the generative modeling of multimodal data achieved state-of-the-art performance on the MIR Flickr data set
. On the other hand, it also shares with LDA and its different extensions the reliance on a stochastic latent representation of the data, requiring variational approximations and MCMC sampling at training and test time. Another neural network based state-of-the-art multimodal data modeling approach is Multimodal Deep Recurrent Neural Network (MDRNN) which aims at predicting missing data modalities through the rest of data modalities by minimizing the variation of information rather than maximizing likelihood.
Recently, an alternative generative modeling approach for documents was proposed in Larochelle and Lauly 
. In this work, a Document Neural Autoregressive Distribution Estimator (DocNADE) is proposed, which models directly the joint distribution of the words in a document by decomposing it as a product of conditional distributions (through the probability chain rule) and modeling each conditional using a neural network. Hence, DocNADE doesn’t incorporate any latent random variables over which potentially expensive inference must be performed. Instead, a document representation can be computed efficiently in a simple feed-forward fashion, using the value of the neural network’s hidden layer.Larochelle and Lauly  also show that DocNADE is a better generative model of text documents than LDA and the RS model, and can extract a useful representation for text information retrieval.
In this paper, we consider the application of DocNADE to deal with multimodal data in computer vision. More specifically, we first propose a supervised variant of DocNADE (SupDocNADE), which can be used to model the joint distribution over an image’s visual words, annotation words and class label. The model is illustrated in Figure 1. We investigate how to successfully incorporate spatial information about the visual words and highlight the importance of calibrating the generative and discriminative components of the training objective. Our results confirm that this approach can outperform other topic models, such as the supervised variant of LDA. Moreover, we propose a deep extension of SupDocNADE, that learns a deep and discriminative representation of pairs of images and annotation words. The deep version of SupDocNADE, which is illustrated in Figure 2, outperforms its shallow one and achieves state-of-the-art performance on the challenging MIR Flickr data set.
As previously mentioned, multimodal data is often modeled using extensions of the basic LDA topic model, such as Corr-LDA , Multimodal LDA  and MDRF . In this paper, we focus on learning a joint representation from three different modalities: image visual words, annotations, and class labels. The class label describes the image globally with a single descriptive label (such as coast, outdoor, inside city, etc.), while the annotation focuses on tagging the local content within the image. Wang et al.  proposed a supervised LDA formulation to tackle this problem. Wang et al.  opted instead for a maximum margin formulation of LDA (MMLDA). Our work also belongs to this line of work, extending topic models to a supervised variant: our first contribution in this paper is thus to extend a different topic model, DocNADE, to this context for multimodal data modeling.
What distinguishes DocNADE from other topic models is its reliance on an autoregressive neural network architecture. Recently, deep neural networks are increasingly used for the probabilistic modeling of images and text (see  for a review). The work of Srivastava and Salakhutdinov  on DBMs and Sohn et al.  on MDRNN are good recent examples. Ngiam et al. 
also proposed deep autoencoder networks for multimodal learning, though this approach was recently shown to be outperformed by DBMs and MDRNN . Although DocNADE shows favorable performance over other topic models, the lack of an efficient deep formulation reduces its ability of modeling multimodal data, especially compared with the deep neural network based models [19, 12, 13]. Thus, the second contribution of this paper is to propose an efficient deep version of DocNADE and its supervised variant. As we’ll see, the deep version of our DocNADE model will outperform the DBM approach of Srivastava and Salakhutdinov .
In this section, we describe the original DocNADE model. In Larochelle and Lauly , DocNADE was used to model documents of real words, belonging to some predefined vocabulary. To model image data, we assume that images have first been converted into a bag of visual words. A standard approach is to learn a vocabulary of visual words by performing -means clustering on SIFT descriptors densely exacted from all training images. See Section 6.1.2 for more details about this procedure. From that point on, any image can thus be represented as a bag of visual words , where each is the index of the closest -means cluster to the SIFT descriptor extracted from the image and is the number of extracted descriptors for image .
DocNADE models the joint probability of the visual words by rewriting it as
and modeling instead each conditional , where is the subvector containing all such that 111 We use a random ordering of the visual words in Equation 1 for each image, and we find it works well in practice. See the discussion in Section 4.1 for more details. . Notice that Equation 1 is true for any distribution, based on the probability chain rule. Hence, the main assumption made by DocNADE is in the form of the conditionals. Specifically, DocNADE assumes that each conditional can be modeled and learned by a feedforward neural network.
One possibility would be to model with the following architecture:
is an element-wise non-linear activation function,and are the connection parameter matrices, and
are bias parameter vectors andare the number of hidden units (topics) and vocabulary size, respectively.
Computing the distribution of Equation 3 requires time linear in . In practice, this is too expensive, since it must be computed for each of the visual words . To address this issue, Larochelle and Lauly  propose to use a balanced binary tree to decompose the computation of the conditionals and obtain a complexity logarithmic in . This is achieved by randomly assigning all visual words to a different leaf in a binary tree. Given this tree, the probability of a word is modeled as the probability of reaching its associated leaf from the root. Larochelle and Lauly  model each left/right transition probabilities in the binary tree using a set of binary logistic regressors taking the hidden layer as input. The probability of a given word can then be obtained by multiplying the probabilities of each left/right choices of the associated tree path.
Specifically, let be the sequence of tree nodes on the path from the root to the leaf of and let be the sequence of binary left/right choices at the internal nodes along that path. For example, will always be the root node of the binary tree, and will be if the word leaf is in the left subtree or otherwise. Let
now be the matrix containing the logistic regression weights andbe a vector containing the biases, where is the number of inner nodes in the binary tree and is the number of hidden units. The probability is now modeled as
are the internal node logistic regression outputs and
is the sigmoid function. By using a balanced tree, we are guaranteed that computing Equation4 involves only logistic regression outputs. One could attempt to optimize the organization of the words within the tree, but a random assignment of the words to leaves works well in practice .
of DocNADE, we simply optimize the average negative log-likelihood of the training set documents using stochastic gradient descent.
Equations 4,5 indicate that the conditional probability of each word requires computing the position dependent hidden layer , which extracts a representation out of the bag of previous visual words . Since computing is in on average, and there are hidden layers to compute, then a naive procedure for computing all hidden layers would be in .
However, noticing that
and exploiting that fact that the weight matrix
is the same across all conditionals, the linear transformationcan be reused from the computation of the previous hidden layer to compute . With this procedure, computing all hidden layers sequentially from to becomes in .
Finally, since the computation complexity of each of the logistic regressions in Equation 4 is , the total complexity of computing is . In practice, the length of document and the number of hidden units tends to be small, while will be small even for large vocabularies. Thus DocNADE can be used and trained efficiently.
Once the model is trained, a latent representation can be extracted from a new document as follows:
This representation could be fed to a standard classifier to perform any supervised computer vision task. The indexis used to highlight that it is the representation used to predict the class label of the image.
In this section, we describe the approach of this paper, inspired by DocNADE, to learn jointly from multimodal data. Here, we will concentrate on the single layer version of our model and discuss its deep extension later, in Section 5.
First, we describe a supervised extension of DocNADE (SupDocNADE), which incorporates the class label modality into training to learn more discriminative hidden features for classification. Then we describe how we exploit the spatial position information of the visual words. Finally, we describe how to jointly model the text annotation modality with SupDocNADE.
It has been observed that learning image feature representations using unsupervised topic models such as LDA can perform worse than training a classifier directly on the visual words themselves, using an appropriate kernel such as a pyramid kernel . One reason is that the unsupervised topic features are trained to explain as much of the entire statistical structure of images as possible and might not model well the particular discriminative structure we are after in our computer vision task. This issue has been addressed in the literature by devising supervised variants of LDA, such as Supervised LDA or sLDA . DocNADE also being an unsupervised topic model, we propose here a supervised variant of DocNADE, SupDocNADE, in an attempt to make the learned image representation more discriminative for the purpose of image classification.
Specifically, given an image and its class label , SupDocNADE models the full joint distribution as
As in DocNADE, each conditional is modeled by a neural network. We use the same architecture for as in regular DocNADE. We now only need to define the model for .
Since is the image representation that we’ll use to perform classification, we propose to model as a multiclass logistic regression output computed from :
where , is the bias parameter vector in the supervised layer and is the connection matrix between hidden layer and the class label.
Put differently, is modeled as a regular multiclass neural network, taking as input the bag of visual words . The crucial difference however with a regular neural network is that some of its parameters (namely the hidden unit parameters and ) are also used to model the visual word conditionals .
Maximum likelihood training of this model is performed by minimizing the negative log-likelihood
averaged over all training images. This is known as generative learning . The first term is a purely discriminative term, while the second is unsupervised and can be understood as a regularizer, that encourages a solution which also explains the unsupervised statistical structure within the visual words. In practice, this regularizer can bias the solution too strongly away from a more discriminative solution that generalizes well. Hence, similarly to previous work on hybrid generative/discriminative learning, we propose instead to weight the importance of the generative term
where is treated as a regularization hyper-parameter.
Optimizing the training set average of Equation 12 is performed by stochastic gradient descent, using backpropagation to compute the parameter derivatives. As in regular DocNADE, computation of the training objective and its gradient requires that we define an ordering of the visual words. Though we could have defined an arbitrary path across the image to order the words (e.g. from left to right, top to bottom in the image), we follow Larochelle and Lauly  and randomly permute the words before every stochastic gradient update. The implication is that the model is effectively trained to be a good inference model of any conditional , for any ordering of the words in . This again helps fighting against overfitting and better regularizes our model. One could thus think of SupDocNADE as learning from a sequence of random fixations performed in a visual scene.
In our experiments, we used the rectified linear function as the activation function
which often outperforms other activation functions  and has been shown to work well for image data . Since this is a piece-wise linear function, the (sub-)gradient with respect to its input, needed by backpropagation to compute the parameter gradients, is simply
where is 1 if is true and 0 otherwise.
Spatial information plays an important role for understanding an image. For example, the sky will often appear on the top part of the image, while a car will most often appear at the bottom. A lot of previous work has exploited this intuition successfully. For example, in the seminal work on spatial pyramids , it is shown that extracting different visual word histograms over distinct regions instead of a single image-wide histogram can yield substantial gains in performance.
We follow a similar approach, whereby we model both the presence of the visual words and the identity of the region they appear in. Specifically, let’s assume the image is divided into several distinct regions , where is the number of regions. The image can now be represented as
where is the region from which the visual word was extracted. To model the joint distribution over these visual words, we decompose it as and treat each possible visual word/region pair as a distinct word. One implication of this is that the binary tree of visual words must be larger so as to have a leaf for each possible visual word/region pair. Fortunately, since computations grow logarithmically with the size of the tree, this is not a problem and we can still deal with a large number of regions.
So far, we’ve described how to model the visual word and class label modalities. In this section, we now describe how we also model the annotation word modality with SupDocNADE.
Specifically, let be the predefined vocabulary of all annotation words, we will note the annotation of a given image as where , with being the number of words in the annotation. Thus, the image with its annotation can be represented as a mixed bag of visual and annotation words:
To embed the annotation words into the SupDocNADE framework, we treat each annotation word the same way we deal with visual words. Specifically, we use a joint indexing of all visual and annotation words and use a larger binary word tree so as to augment it with leaves for the annotation words. By training SupDocNADE on this joint image/annotation representation , it can learn the relationship between the labels, the spatially-embedded visual words and the annotation words.
At test time, the annotation words are not given and we wish to predict them. To achieve this, we compute the document representation based only on the visual words and compute for each possible annotation word the probability that it would be the next observed word , based on the tree decomposition as in Equation 4. In other words, we only compute the probability of paths that reach a leaf corresponding to an annotation word (not a visual word). We then rank the annotation words in in decreasing order of their probability and select the top 5 words as our predicted annotation.
Although SupDocNADE has achieved better performance than the other topic models in our previous work , the lack of an efficient deep formulation of SupDocNADE reduces its capability of modeling multimodal data, especially compared with other models based on deep neural network [13, 12].
Recently, Uria et al.  proposed an efficient extension of the original NADE model  for binary vector observations, from which DocNADE was derived. We take inspiration from Uria et al.  and propose SupDeepDocNADE, i.e. a supervised deep autoregressive neural topic model for multimodal data modeling.
In this section, we introduce the deep extension of DocNADE (DeepDocNADE) and then describe how to incorporate supervised information into its training. We also discuss how to deal with the inbalance between the number of visual words and annotation words, in order to obtain good performances. Before we start the discussion, we note that the notation , which denotes the words of an image, includes both visual words and annotation words of an image in the following section, as is discussed in Section 4.3
We first revisit the training procedure for DocNADE. We will concentrate on the unsupervised version of DocNADE for now and discuss the supervised case later.
In Section 4.1 we mentioned that words are randomly permuted before every stochastic gradient update, to make DocNADE be a good inference model for any ordering of the words. As Uria et al.  notice, we can think of the use of many orderings as the instantiation of many different DocNADE models, one for each distinct ordering. From that point of view, by training a single set of parameters (connection matrices and biases) on all these orderings, we are effectively employing a parameter sharing strategy across these models and the training process can be interpreted as training a factorial number of DocNADE models simultaneously.
We will now make the notion of ordering more explicit in our notation. Following Uria et al. , we now denote as the joint distribution of the DocNADE model over the words of an image given the parameters and ordering . We will also note as the conditional distribution described in Equation 3 or 4, where is the subvector of the previous words extracted from an ordered word vector , and is the word of . Notice that the ordering is now treated explicitly as a random variable.
Thus, training DocNADE on stochastically sampled orderings corresponds, in expectation, to minimize the negative log-likelihood across all possible orderings, for each training example :
where is the set of all orderings.
By moving the expectation over orderings, , inside the summation over the conditionals, the expectation can be split into three parts222 The split is done in a modality-agnostic way, i.e. the visual words and annotations words are mixed together and are treated equally when training the model.: one over , standing for the first indices in the ordering ; one over , which is the index of the ordering ; and one over , standing for the remaining indices of the ordering.
Hence, the loss function can be rewritten as:
Noting that the value of each conditional does not depend on , Equation 19 can then be simplified as:
In practice, Equation 20 still sums over a number of terms of too large to be performed exhaustively. For training, we thus use a stochastic estimation and replace the expectations/sums over and with samples. On the other hand, the innermost expectation over can be obtained cheaply. Indeed, for a given value of and , all terms require the computation of the same hidden layer representation from the subvector . Therefore, can be estimated by:
where is the number of words (including both visual and annotation words) in . In words, Equation 21 measures the ability of the model to predict, from a fixed and random context of words , any of the remaining words in the image/annotation.
From this, training of DocNADE can be performed by stochastic gradient descent. For a given training example , a training update is performed as follows333In experiments, both visual words and annotation words are represented in Bag of Words (BoW) fashion. As is shown in Section 5.2, the training processing actually equals to generating a word vector from BoW, shuffling the word vector and splitting it, and then regenerating the histogram and , which is inefficient for processing samples in a mini-batch fashion. Hence, in practice, we split the original histogram directly by uniformly sampling how many are put in the left of the split (the others are put on the right of the split) for each individual word. This is not equivalent to the one mentioned in this paper, but it works well in practice. :
It should be noticed that, since the number of words in an image/annotation pair can vary across examples, the value of will vary between updates, unlike in Uria et al.  will models binary vectors of fixed size.
We can contrast this procedure from the one described in Section 4.1, which prescribed a stochastic estimation with respect to the possible orderings of the words and an exhaustive sum in predicting all the words in the sequence. Here, we have the opposite: it is stochastic by predicting a subset of the words but is (partially) exhaustive by implicitly summing the gradient contributions over several orderings sharing the same permutation up to position .
As shown in Section 5.1, training of DocNADE can be performed by randomly splitting the words into two parts, and , and applying stochastic gradient descent on the loss function of Equation 21. Thus, the training procedure now corresponds to a neural network, with being the input and as the output’s target. The advantage of this approach is that DocNADE can more easily be extended to a deep version this way, which we will refer to as DeepDocNADE.
Indeed, as mentioned in the previous section, all conditionals in the summation of Equation 21 require the computation of a single hidden layer representation:
where is the histogram vector representation of the word sequence and where the exponent is used to index the first hidden layer and its parameters.
So, unlike in the original training procedure for DocNADE, a training update now requires the computation of a single hidden layer, instead of hidden layers. This way, adding more hidden layers only has an additive, instead of multiplicative, effect on the complexity of each training update. Hidden layers are added as in regular deep feedforward neural networks, as follows:
where and are the connection matrix and bias for hidden layer , , where is the number of hidden layers.
To compute the conditional in Equation 21
after obtaining the hidden representation, the binary tree introduced in Section 3 could be used for an efficient implementation. However, in cases where the histogram of future words is not sparse, the binary tree output model might not be the most efficient approach. For example, suppose is full (has no zero entries) and the vocabulary size is , the computation of Equation 21 via the binary tree is in , since it has to compute logistic regressions for each of the words in . In this specific scenario however, going back to a softmax model of the conditionals is preferable. Indeed, since all conditionals in Equation 21 share the same hidden representation and thus the normalization term in the softmax is the same for all future words, it is only in . Another advantage of the softmax over the binary tree is that the softmax is more amenable to an efficient implementation on the GPU, which will also speed up the training process.
In the end, for the experiments with the deep extension of DocNADE of this paper, we opted for the softmax model as we’ve found it to be more efficient. We emphasize however that the binary tree is still the most efficient option for the loss function of Equation 12 or when the histogram of future words is sparse.
Deep Document NADE can also be extended to a supervised variant, which is referred to as SupDeepDocNADE, following the formulation in Section 4.1.
Specifically, to add the supervised information into DeepDocNADE, the negative log-likelihood function in Equation 17 could be extended as follows:
Since is independent of , Equation 26 can be rewritten as:
Then can be approximated by sampling , and as follows:
is supervised, while the second term is unsupervised and can be interpreted as a regularizer. Thus, we can also weight the importance of the unsupervised part by a hyperparameterand obtain a hybrid cost function:
Equation 5.3 can then be used as the per-example loss and optimized over the training set using stochastic gradient descent.
As mentioned in Section 4.3, the annotation words can be embedded into the framework of SupDocNADE by treating them the same way we deal with visual words. In practice, however, the number of visual words could be much larger than that of the annotation words. For example, in the MIR Flickr data set, with the experimental setup of Srivastava and Salakhutdinov , the average number of visual words for an image is about , which is much larger than the average number of annotation words for an image (). The imbalance of visual words and annotation words might cause some problems. For example, the contribution to the hidden representation from the annotation words is so small that it might be ignored compared with the contribution from the huge mount of visual words, and the gradients coming from the annotation words might also be too small to have any meaningful effect for increasing the conditionals probability of the annotation words.
To deal with this problem, we propose to weight the annotation words in the histogram and . More specifically, let be a vector containing components, where is the vocabulary size (including both visual and annotation words), each component corresponding to a word (either visual or annotation). The components corresponding to the visual words is set to and the components corresponding to the annotation word is set to . Then the new histogram of and is computed as
where is element-wise multiplication.
Moreover, the hybrid cost function of Equation 5.3 is rewritten as:
where is a conditional probability obtained by replacing with in Equation 23, and is a function that assigns weight if is an annotation word, and otherwise.
By weighting annotation words in the histogram, the model will pay more attention to the annotation words, reducing the problem caused by the imbalance between visual and annotation words. In practice, the weight is a hyper-parameter and can be selected by cross-validation. As we’ll see in Section 6.2.4, weighting annotation words more heavily can significantly improve the performance.
Besides the spatial information and annotation which are embedded into the framework of DocNADE in Section 4.2 and Section 4.3, bottom-up global features, such as Gist  and MPEG-7 descriptors , can also play an important role in multimodal data modeling . Global features can, among other things, complement the local information extracted from patch-based visual words. In this section, we describe how to embed such features into the framework of our model.
Specifically, let be the global feature vector extracted from an image, where is the length of the global feature vector. One possibility for embedding into the model could be to condition the hidden representation on the global feature as follows:
where is a connection matrix specific to the global features. This can be understood as a hidden layer whose hidden unit biases are conditioned on the image’s global features vector . Thus, the whole model is conditioned not only on previous words but also on the global features .
In this section, we compare the performance of our model over the other models for multimodal data modeling. Specifically, we first test the ability of the single hidden layer SupDocNADE to learn from multimodal data on two real-world data sets which are widely used in the research on other topic models. Then we test the performance of SupDeepDocNADE on the largescale multimedia informaton retrieval (MIR) Flickr data set and show that SupDeepDocNADE achieves state-of-the-art performance. The code to download the data sets and for SupDocNADE and SupDeepDocNADE is available at https://sites.google.com/site/zhengyin1126/home/supdeepdocnade.
To test the ability of the single hidden layer SupDocNADE to learn from multimodal data, we measured its performance under simultaneous image classification and annotation tasks. We tested our model on 2 real-world data sets: a subset of the LabelMe data set  and the UIUC-Sports data set . LabelMe and UIUC-Sports come with annotations and are popular classification and annotation benchmarks. We performed extensive quantitative comparisons of SupDocNADE with the original DocNADE model and supervised LDA (sLDA)444We mention that  has shown that sLDA performs better than Corr-LDA. Moreover,  found that Multimodal LDA  did not improve on the performance of Corr-LDA. Finally, sLDA distinguishes itself from the other models in the fact that it also supports the class label modality and has code available online. Hence, we compare directly with sLDA only. [10, 9]. We also provide some comparisons with MMLDA  and a Spatial Pyramid Matching (SPM) approach .
Following Wang et al. , we constructed our LabelMe data set using the online tool to obtain images of size pixels from the following 8 classes: highway, inside city, coast, forest, tall building, street, open country and mountain. For each class, 200 images were randomly selected and split evenly in the training and test sets, yielding a total of 1600 images.
The UIUC-Sports data set contains 1792 images, classified into 8 classes: badminton (313 images), bocce (137 images), croquet (330 images), polo (183 images), rockclimbing (194 images), rowing (255 images), sailing (190 images), snowboarding (190 images). Following previous work, the maximum side of each image was resized to 400 pixels, while maintaining the aspect ratio. We randomly split the images of each class evenly into training and test sets. For both LabelMe and UIUC-Sports data sets, we removed the annotation words occurring less than 3 times, as in Wang et al. .
Following Wang et al. , 128 dimensional, densely extracted SIFT features were used to extract the visual words. The step and patch size of the dense SIFT extraction was set to 8 and 16, respectively. The dense SIFT features from the training set were quantized into 240 clusters, to construct our visual word vocabulary, using -means. We divided each image into a grid to extract the spatial position information, as described in Section 4.2. This produced different visual word/region pairs.
We use classification accuracy to evaluate the performance of image classification and the average F-measure of the top 5 predicted annotations to evaluate the annotation performance, as in previous work. The F-measure of an image is defined as
where recall is the percentage of correctly predicted annotations out of all ground-truth annotations for an image, while the precision is the percentage of correctly predicted annotations out of all predicted annotations555When there are repeated words in the ground-truth annotations, the repeated terms were removed to calculate the F-measure.. We used 5 random train/test splits to estimate the average accuracy and F-measure.
Image classification with SupDocNADE is performed by feeding the learned document representations to a RBF kernel SVM. In our experiments, all hyper-parameters (learning rate, unsupervised learning weight in SupDocNADE, and in RBF kernel SVM), were chosen by cross validation. We emphasize that, again from following Wang et al. , the annotation words are not available at test time and all methods predict an image’s class based solely on its bag of visual words.
In this section, we describe our quantitative comparison between SupDocNADE, DocNADE and sLDA. We used the implementation of sLDA available at http://www.cs.cmu.edu/~chongw/slda/ in our comparison, to which we fed the same visual (with spatial regions) and annotation words as for DocNADE and SupDocNADE.
Classification performance comparison on LabelMe (even) and UIUC-Sports (odd). On the left, we compare the classification performance of SupDocNADE, DocNADE and sLDA. On the right, we compare the performance between different variants of SupDocNADE. The “varies” means the unsupervised weight in Equation 12 is chosen by cross-validation.
The classification results are illustrated in Figure 3. Similarly, we observe that SupDocNADE outperforms DocNADE and sLDA. Tuning the trade-off between generative and discriminative learning and exploiting position information is usually beneficial. There is just one exception, on LabelMe, with 200 hidden topic units, where using a grid slightly outperforms a grid.
As for image annotation, we computed the performance of our model with 200 topics. As shown in Table 1, SupDocNADE obtains an -measure of and on the LabelMe and UIUC-Sports data sets respectively. This is slightly superior to regular DocNADE. Since code for performing image annotation using sLDA is not publicly available, we compare directly with the results found in the corresponding paper . Wang et al.  report -measures of and for sLDA, which is below SupDocNADE by a large margin.
We also compare with MMLDA , which has been applied to image classification and annotation separately. The reported classification accuracy for MMLDA is less than SupDocNADE as shown in Table 1. The performance for annotation reported in Wang et al.  is better than SupDocNADE on LabelMe but worse on UIUC-Sports. We highlight that MMLDA did not deal with the class label and annotation word modalities jointly, the different modalities being treated separately.
The spatial pyramid approach of Lazebnik et al.  could also be adapted to perform both image classification and annotation. We used the code from Lazebnik et al.  to generate two-layer SPM representations with a vocabulary size of 240, which is the same configuration as used by the other models. For image classification, an SVM with Histogram Intersection Kernel (HIK) is adopted as the classifier, as in Lazebnik et al. . For annotation, we used a
nearest neighbor (KNN) prediction of the annotation words for the test images. Specifically, the top 5 most frequent annotation words among thenearest images (based on the SPM representation with HIK similarity) in the training set were selected as the prediction of a test image’s annotation words. The number was selected by cross validation, for each of the 5 random splits. As shown in Table 1, SPM achieves a classification accuracy of and for LabelMe and UIUC-Sports, which is lower than SupDocNADE. As for annotation, the -measure of SPM is also lower than SupDocNADE, with and for LabelMe and UIUC-Sports, respectively.
Figure 4 illustrates examples of correct and incorrect predictions made by SupDocNADE on the LabelMe data set.
Since topic models are often used to interpret and explore the semantic structure of image data, we looked at how we could observe the structure learned by SupDocNADE.
We extracted the visual/annotation words that were most strongly associated with certain class labels within SupDocNADE as follows. Given a class label street, which corresponds to a column in matrix , we selected the top 3 topics (hidden units) having the largest connection weight in . Then, we averaged the columns of matrix corresponding to these 3 hidden topics and selected the visual/annotation words with largest averaged weight connection. The results of this procedure for classes street, sailing, forest and highway is illustrated in Figure 5. To visualize the visual words, we show 16 image patches belonging to each visual word’s cluster, as extracted by -means. The learned associations are intuitive: for example, the class street is associated with the annotation words “building”, “buildings”, “window”, “person walking” and “sky”, while the visual words showcase parts of buildings and windows.
We now test the performance of SupDeepDocNADE, the deep extension of SupDocNADE, on the large-scale MIR Flickr data set . MIR Flickr is a challenging benchmark for multimodal data modeling task. In this section, we will show that SupDeepDocNADE achieves state-of-the-art performance on the MIR Flickr data set over strong baselines : the DBM apporach of Srivastava and Salakhutdinov , MDRNN , TagProp  and the multiple kernel learning approach of Verbeek et al. .
The MIR Flickr data set contains million real images that are collected from the image hosting website Flickr. The social tags of each image are also collected and used as annotations in our experiments. Among the million images, there are images that is labeled into classes, such as sky, bird, people, animals, car, etc., giving us a subset of labeled images. Each image in the labeled subset can have multiple class labels. In our experiments, we used images for training and images for testing. The remaining images do not have labels and thus were used for the unsupervised pretraining of SupDeepDocNADE (see next section). The most frequent tags are collected for the annotation vocabulary, following previous work [12, 13]. The averaged number of annotations for an image is . In the whole data set, images do not have annotations, out of which images are in the labeled subset.
In order to compare directly with the DBM approach of Srivastava and Salakhutdinov , we use the same experimental configuration. Specifically, the images in MIR Flickr are first rescaled to make the maximum side of each image be pixels, keeping the aspect ratio. Then, dimensional SIFT features are densely sampled on these images to extract the visual words. Following Srivastava and Salakhutdinov , we used different scales of patch size, which are pixels, respectively, and the patch step is fixed to pixels. The SIFT features from the unlabeled images were quantized into 2000 clusters, which is used as the visual word vocabulary. Thus, the image modality is represented by the bag of visual words representation using this vocabulary. As preliminary experiments suggested that spatial information (see Section 4.2) wasn’t useful on the Flickr data set, we opted for not using it here. Similarly, the text modality for SupDeepDocNADE is represented using the annotation vocabulary, which is built upon the most frequent tags, as is mentioned in Section 6.2.1. The visual words and annotation words are combined together and treated as the input of SupDeepDocNADE.
As for the global features (Section 5.5), a combination of Gist  and MPEG-7 descriptors (EHD, HTD, CSD, CLD, SCD) is adopted in our experiments, as in Srivastava and Salakhutdinov . The length of the global features is .
We used a hidden layers architecture in our experiments, with the size of each hidden layer being . Note that the DBM [12, 13] also use hidden layers with hidden units for each layer, thus our comparison with the DBM is fair. The activation function for the hidden units is the rectified linear function. We used a softmax output layer instead of a binary tree to compute the conditionals for SupDeepDocNADE, as discussed in Section 5.2.
For the prediction of class labels, since images in MIR Flickr could have multiple labels, we used a sigmoid output layer instead of the softmax to compute the probability that an image belongs to a specific class
where is the hidden representation of the top layer. As a result, the supervised cost part in Equation 5.3 is replaced by the cross entropy , where is the number of classes.
In all experiments, the unlabeled images are used for unsupervised pretraining. This is achieved by first training a DeepDocNADE model, without any output layer predicting class labels. The result of this training is then used to initialize the parameters of a SupDeepDocNADE model, which is finetuned on the labeled training set based on the loss of Equation 32.
Once training is finalized, the hidden representation from the top hidden layer after observing all words (both visual words and annotation words) of an image is feed to a linear SVM  to compute confidences of an image belonging to each class. The average precision (AP) for each class is obtained based on these confidences, where AP is the area under the precision-recall curve. After that, the mean average precision (MAP) over all classes is computed and used as the metric to measure the performance of the model. We used the same training/validation/test set splits on the labeled subset of MIR Flickr as Srivastava and Salakhutdinov  and report the average performance on the splits.
To initialize the connection matrices, we followed the recommendation of Glorot and Bengio 
used a uniform distribution:
where is a connection matrix, , are the number of rows and columns respectively of matrix , respectively, and is the uniform distribution. In practice, we’ve also found it useful to normalize the input histograms
for each image, by rescaling them to have unit variance.
The hyper-parameters (learning rate, unsupervised weight , and the parameter for linear SVM, etc.) are chosen by cross-validation. To prevent overfitting, dropout  is adopted during training, with a dropout rate of for all hidden layers. We also maintained an exponentially decaying average of the parameter values throughout the gradient decent training procedure and used the averaged parameters at test time. This corresponds to Polyak averaging , but where the linear average is replaced by a weighting that puts more emphasis on recent parameter values. For the annotation weight, it was fixed to , which is approximately the ratio of the averaged visual words and annotation words of the data set. We will investigate the impact of the annotation weight on the performance in Section 6.2.4.
|Multiple Kernel Learning SVMs |
|Multimodal DBM |
SupDeepDocNADE (1 hidden layer, 625 epochs pretraining)
|SupDeepDocNADE (2 hidden layers, 625 epochs pretraining)|
|SupDeepDocNADE (3 hidden layers, 625 epochs pretraining)|
|SupDeepDocNADE (2 hidden layers, 2325 epochs pretraining)|
|SupDeepDocNADE (3 hidden layers, 2325 epochs pretraining)|
|SupDeepDocNADE (2 hidden layers, 4125 epochs pretraining)|
|SupDeepDocNADE (3 hidden layers, 4125 epochs pretraining)|
Table 2 presents a comparison of the performance of SupDeepDocNADE with the DBM approach of Srivastava and Salakhutdinov  and MDRNN of Sohn et al.  as well as other strong baselines, in terms of MAP performance. We also provide the simple and popular TF-IDF baseline in Table 2 to make the comparison more complete. The TF-IDF baseline is conducted only on the bag-of-words representations of images without global features. We feed the TF-IDF representations to a linear SVM to obtain confidences of an image belonging to each class and then we compute the Mean AP, as for SupDeepDocNADE.
We can see that SupDeepDocNADE achieves the best performance among all methods. More specifically, we first pretrained the model for epochs on the unlabeled data with , and hidden layers. The results illustrated in Table 2 show that SupDeepDocNADE outperforms the DBM baseline by a large margin. Moreover, we can see that SupDeepDocNADE with and hidden layers performs better than with only hidden layer, with epochs of pretraining. We then pretrained the model for more epochs on the unlabeled data ( epochs). As shown in Table 2, with more pretraining epochs, the deeper model ( hidden layers) performs even better. This confirms that the use of a deep architecture is beneficial. When the number of pretraining epochs reaches , the SupDeepDocNADE model with hidden layers achieves a MAP of , which outperforms all the strong baselines and increases the performance gap with the 2-hidden-layers model.
From Tabel 2 we can also see that the performance of 2-layers SupDeepDocNADE does not improve as much as 3-layers SupDeepDocNADE when the number of pretraining epochs increases from to . Figure 6 shows the the performance of SupDeepDocNADE w.r.t the number of pretraining epochs. We can see from Figure 6 that with more epochs of pretraining, the performance of 3-layers SupDeepDocNADE increases faster than the 2-layers models, which indicates that the capacity of 3-layers SupDeepDocNADE is bigger than the 2-layers model and the capacity could be leveraged by more pretraining. Figure 6 also suggests that the performance of SupDeepDocNADE could be even better than with more pretraining epochs.
Figure 7 illustrates some failed predictions of SupDeepDocNADE, where the reasons for failure are shown on the left-side of each row. One of the reasons for failure is that the local texture/color is ambiguous or misleading. For example, in the first image of the top row, the blue color in the upper side of the wall misleads the model to predict ”sky” with a confidence of . Another type of failure, which is shown in the middle row of Figure 7, is caused by images of an abstract illustration of the class. For instance, the model fails to recognize the bird, car and tree in the images of the middle row, respectively, as these images are merely abstract illustrations of these concepts. The third reason illustrated in the bottom row is that the class takes a small portion of the image, making it more likely to be ignored. For example, the female face on the stamp in the first image of the bottom row is too small to be recognized by the model. Note that we just illustrated some failed examples and there might be other kinds of failures. In practice, we also find that some images are not correctly labeled, which might also cause some failures.
Having established that SupDeepDocNADE achieves state-of-the-art performance on the MIR Flickr data set and also discussed some failed examples, we now explore in more details some of its properties in the following sections.
In Section 6.2.4, we proposed to weight differently the annotation words to deal with the problem of imbalance in the number of visual and annotation words. In this part, we investigate the influence of the annotation weight on the performance. Specifically, we set the annotation weight to , and show the performance for each of the annotation weight values. Note that when the annotation weight equals , there is no compensation for the imbalance of visual words and annotation words. The other experimental configurations are the same as in Section 6.2.2.
Figure 8 shows the performance comparison between different annotation weights. As expected, SupDeepDocNADE performs extremely bad when the annotation equals to , When the annotation weight is increased, the performance gets better. Among all the chosen annotation weights, performs best, which achieves a MAP of . The other annotation weights also achieves good performance compared with the DBM model : MAP of , and for annotation weight values of , and , respectively.
Since SupDeepDocNADE is used for multimodal data modeling, we illustrate here some results for multimodal data retrieval tasks.
More specifically, we show some qualitative results in two multimodal data retrieval scenarios: multimodal data query and generation of text from images.
Multimodal Data Query:
Given a query corresponding to an image/annotation pair, the task is to retrieve other similar pairs from a collection, using the hidden representation learned by SupDeepDocNADE. In this task, the cosine similarity is adopted as the similarity metric. In this experiment, each query corresponds to an individual test example and the collection corresponds to the rest of the test set. Figure10 illustrates the retrieval results for multimodal data query task, where we show the most similar images to the query input in the testset.
In this paper, we proposed SupDocNADE, a supervised extension of DocNADE, which can learn jointly from visual words, annotations and class labels. Moreover, we proposed a deep extension of SupDocNADE which outperforms its shallow version and can be trained efficiently. Although both SupDocNADE and SupDeepDocNADE are the same in nature, SupDeepDocNADE differs from the single layer version in its training process. Specifically, the training process of SupDeepDocNADE is performed over a subset of the words by summing the gradients over several orderings sharing the same permutation up to a randomly selected position , while the single layer version does the opposite and exploits a single randomly selected ordering but updates all the conditionals on the words.
Like all topic models, our model is trained to model the distribution of the bag-of-word representations of images and can extract a meaningful representation from it. Unlike most topic models however, the image representation is not modeled as a latent random variable in a model, but instead as the hidden layer of a neural autoregressive network. A distinctive advantage of SupDocNADE is that it does not require any iterative, approximate inference procedure to compute an image’s representation. Our experiments confirm that SupDocNADE is a competitive approach for multimodal data modeling and SupDeepDocNADE achieves state-of-the-art performance on the challenging multimodal data benchmark MIR Flickr.
, “Multimodal semi-supervised learning for image classification,” inCVPR, 2010.
N. Srivastava and R. R. Salakhutdinov, “Discriminative transfer learning with tree-based priors,” inNIPS, 2013.
Journal of Machine Learning Research, 2011.