The representation of an image has a large impact on the ease and efficiency with which prediction can be performed. This has generated a huge interest in directly learning representation from data . Generative models for representation learning treat the desired representation as an unobserved latent variable [2, 3, 4]. Topic models, which are generally a group of generative models based on Latent Dirichlet Allocation (LDA) 
, have successfully been applied for learning representations that are suitable for computer vision tasks[5, 6, 7]. A topic model learns a set of topics, which are distributions over words and represents each document as a distribution over topics. In computer vision applications, a topic is a distribution over visual words, while a document is usually an image or a video. Due to its generative nature, the learned representation will provide rich information about the structure of the data with high interpretability. It offers a highly compact representation and can handle incomplete data, to a high degree, in comparison to other types of representation methodologies. Topic models have been demonstrated with successful performance in many applications. Similar to other latent space probabilistic models, the topic distributions can easily be adapted with different distributions with respect to the types of the input data. In this paper, we will use a LDA model as our basic framework and apply an effective factorized representation learning scheme.
Modeling the essence of the information among all sources of information for a particular task has been shown to offer high interpretability and better performance [6, 8, 9, 10, 11, 12]. For example, for object classification, separating the key features of the object from the intra-class variations and background information is key to the performance. The idea of factorized representation can be traced back to the early work of Tucker, ’An Inter-Battery Method of Factory Analysis’ , hence, we name the model presented in this paper Inter-Battery Topic Model (IBTM).
Imagine a scenario in which we want to visually represent ”a cup of coffee”, illustrated in Figure 1 (a). Apart from a cup of coffee, such images commonly contain additional information that is not correlated to this labeling, e.g., the rose and the table in the upper image and the coffee beans in the lower image. One can think of the information that is common among all images of this class and thus correlated with the label, as the shared information. Images with a cup of coffee will share a set of ”cup of coffee” topics between them. In addition, each image does also contain information that can be found only in a small share of the other images. This information can be thought of as private
. Since the shared, but not the private, information should be employed in the estimation task (e.g., classification), it is highly beneficial to use a factorized model which represents the information needed for the tasks (shared topics) separately from the information that is not task related (private topics).
A similar idea can be applied in the case when two different modalities of the data are available. A common case is images as one modality and the captions of the images as another, as shown in Figure 1 (b). In this scenario, commonly not all of the content in the image has its corresponding caption words; and not every word in the caption has its corresponding image patches. However, the important aspects of the scene or object depicted in the image are also described in the caption, and vice versa, the central aspects of the caption are those that correlate with what is seen in the image. Based on this idea, an ideal multi-modal representation should factorize out information that is present in both modalities (words describing central concepts, and image patches from the corresponding image areas) and represent it separately from information that is only present in one of the modalities (words not correlated with the image, and image patches in the background). Other modality examples include video and audio data captured at the same event, or optical flow and depth measurements extracted from a video stream.
To summarize, there is a strong need of modeling information in a factorized manner such that shared information and private information are represented separately. In our model, the shared part of the representation will capture the aspects of the data that are essential for the prediction (e.g., classification) task, leading to better performance. Additionally, inspecting the factorized latent representation gives a better understanding of the structure of the data, which is helpful in the design of domain-specific modeling and data collection.
The main contribution of this paper is a generative model, IBTM, for factorized representation learning, which efficiently factorizes essential information for an estimation task from information that is not task related (Section 3). This results in a very effective latent representation that can be used for predication tasks, such as classifications. IBTM is a general framework, which is applicable to both single- and multi-modal data, and can easily be adapted to data with different noise levels. To infer the latent variables of the model, we derive an efficient variational inference algorithm for IBTMs.
We evaluate our model in different experimental scenarios (Section 4). Firstly, we test IBTM with a synthetic dataset to illustrate how the learning is performed. Then we apply IBTM to state-of-the-art datasets in different scenarios to illustrate how different computer vision tasks benefit from IBTM. In a multi-modal setting, modality-specific information is factorized from cross-modality information (Section 220.127.116.11 and 18.104.22.168). In a uni-modal setting, instance-specific information is factorized from class-specific information (Section 22.214.171.124 and 126.96.36.199).
2 Related Work
With respect to the scope of this paper, we will summarize the related work mainly from two aspects: Topic Modeling and Factorized Models.
Latent Dirichlet Allocation (LDA)  is the corner stone of topic modeling. In computer vision tasks [5, 6, 7], topic modeling assumes that each visual document is generated by selecting different themes while the themes are distributions over visual words. In correspondence with other works in representation learning, the themes can be interpreted as factors, components or dictionaries. The topic distribution for each document can be interpreted as factor weights or as a sparse and low-dimensional representation of the visual document. This has achieved promising results in different tasks and provided an intuitive understanding of the data structure. For computer vision tasks, topic modeling has been used for classification, either with supervision in the model [13, 14, 15, 16, 17]
or by learning the topic representation in an unsupervised manner and applying standard classifiers such as softmax regression on the latent topic representation. Another interesting direction using topic modeling in computer vision is the multi-modal extension of topic models; it has been applied to tasks such as image annotation [11, 18, 19, 20], contextual action/object recognition  and video tagging . Being a generative model, it represents all information found in the data. However, for a specific task, only a portion of this information might be relevant. Extracting this information is essential for a good representation of the data. Hence a model that describes key information for the current task is beneficial.
The benefit of modeling the between-view variance separately from the within-view variance was first pointed out by Tucker
. It was rediscovered in machine learning in recent years by Ek et.al.. Recent research in latent structure models has also shown that modeling information in a factorized manner is advantageous for both uni-modal scenarios [10, 12, 22], in which only one type of data is available and multi-modal scenarios [6, 9, 21], in which different views correspond to different modalities. For uni-modal scenarios, a special words topic model with a background distribution (SWB)  is one of the first studies on factorized representation using topic model for information retrieval tasks. In addition to topics, SWB uses a words distribution for each document to represent document specific information and a global word distribution for background information. As shown in the experiments, this text-specific factorization model is less suitable for computer vision tasks than IBTM. Works that apply such a factorized scheme on multi-modal topic modeling [6, 11] include the multi-modal factorized topic model  and Video Tags and Topics Model (VTT) . The multi-modal factorized topic model which is based on correlated topic models  only provides an implicit link between different modalities with hierarchical Dirichlet priors since the factorization is enforced on the logistic normal prior, while VTT is only designed for the specific application.
In this paper, we present a general framework IBTM which models the topic structure in a factorized manner and can be applied to both uni- and multi-modal scenarios.
In this section, firstly, we will shortly review LDA  which IBTM is based on and then present the modeling details and inference of IBTM. Finally, we will describe how the latent representation can be used for classification tasks with which we evaluate our approach.
3.1 Latent Dirichlet Allocation
LDA is a classical generative model which is able to model the latent structure of discrete data, for example, a bag of words representation of documents. Figure 2 (a) shows the graphic representation of LDA . In LDA, the words (visual words) are assumed to be generated by sampling from a per document topic distribution and a per topic words distribution . The Dirichlet distribution is a natural choice as it is conjugate to multinomial distribution.
3.2 Inter-Battery Topic Model
We propose the IBTM which models latent variables in a factorized manner for multi-view scenarios. Firstly, we will explain how to apply IBTM to a two view scenario such that it easily can be compared to other models [7, 8, 18, 20]. In the following, we present the more generalized IBTM, which can encode any number of views.
Two View IBTM.
The two view version of IBTM, shown in Figure 2 (c), is an LDA-based model, in which each document contains two views and the words and from the two views are observed respectively. The two views can represent different types of data, such as two modalities, for example, image and caption as in Figure 1 (b); or two different descriptors for the same data, for example, SIFT and SURF features of the same image. They can also be two instances of the same class, for example, the two cups of coffee as in Figure 1 (a).
The key of IBTM is that we assume that topics are factorized. We do not force topics from two views to be matched completely since commonly each view has its view-specific information. Hence, in our model, a shared topic distribution between two views for each document is separated from a private topic distribution for each view. As in Figure 2 (c), is the shared per topic distribution for each document, and correspondingly and are the per shared topic words distributions for each view. and are the private per document topic distributions for each view respectively, and correspondingly and are the private per topic word distributions for each view. To determine how much information is shared and how much information is private, partition parameters and are used for each view. In this case, to generate topic assignments for each word in each view, and are sampled as
The whole IBTM is represented as:
where , and as in the graphic representation of IBTM in Figure 2 (b), the total number of documents is ; the number of words for each document is and for the first view and the second view respectively; the number of shared topics for both views is ; the number of private topics is and and the vocabulary size is and for the first view and the second view respectively.
Mean Field Variational Inference.
Exact inference on this model is intractable due to the coupling between latent variables. Variational inference and sampling based methods are the two main groups of methods to perform approximate inference. Variational inference is known for its fast convergence and theoretical attractiveness. It can also be easily adapted to online requirements when facing big data or streaming data. Hence, in this paper, we use mean field variational inference for IBTM. The fully factorized variational distribution is assumed following the mean field manner:
For each term above, the per document topic distributions are: where ; where ; where . The per word topic assignments are: where such that the first K topics correspond to the shared topics and the last T topics correspond to the private topics; where such that the first K topics correspond to the shared topics and the last S topics correspond to the private topics. The per document beta parameters are: and . Finally, the per topic words distributions are: , , , . All the variational distributions follow the same family of distributions under the model assumption.
Applying Jensen’s inequality on the log likelihood of the model, we get the evidence lower bound (ELBO) :
By maximizing the ELBO, we get the update equations for the variational parameters. Only the ones that differ from LDA are presented here and derivation details are presented in the supplementary material. The update equations for the per document topic variational distribution are:
The update equation for the topic assignment in the first view is, when :
and when (as ):
The update equations for the partition parameters are:
The update for the second view follows equivalently.
In the implementation, all global latent variables are initialized randomly except for the shared per topic word distribution for the second modality, which is initialized uniformly. Due to the exchangeability of Dirichlet distribution which leads to rotational symmetry in the inference, initializing only one of the shared per topic word distribution randomly will increase the robustness of the model performance.
It is straight-forward to generalize the two view IBTM to more views. The graphical representation of the generalized IBTM is shown in Figure 2 (b), where is the total number of views. When , the models in Figure 2 (b) and 2 (c) are identical. The inference procedure can be adapted easily, since the updates of both topic assignments and partition parameters for each view follow the same form. The only difference is the per document shared topic variational distribution , where is the variational distribution of the topic assignment for the -th view.
Topic models provide a compact representation of the data. Both LDA and IBTM are unsupervised models and can be used for representation learning. The topic representation can be applied to different tasks, for example, image classification and image retrieval. Commonly, the whole topic representation will be employed for these tasks using LDA. Using IBTM, we will only rely on the shared topic space which represents the information essence. For image classification, we can simply apply a Support Vector Machine or softmax regression, taking the shared topic representation as the input. In our experimental evaluation, softmax regression is used. Although there are different types of supervised topic models[13, 16] where class label is encoded as part of the model, the work in  shows that the performance on computer vision classification tasks using supervised model and unsupervised model with an additional classifier is similar. The minor improvement on the performance commonly comes with significant improvement of computation cost. Hence, we keep IBTM as a general framework for representation learning in an unsupervised manner.
In the experiments, firstly, we will evaluate the inference scheme and demonstrate the model behavior in a controlled manner in Section 4.1. Then we will use two benchmark datasets to evaluate the model behavior in real world scenarios in Section 4.2
. For this purpose, we use the LabelMe natural scene data for natural scene classification[18, 24, 25] and the Leeds butterfly dataset  for fine-grained categorization.
4.1 Inference Evaluation using Synthetic Data
To test the inference performance, we generate a set of synthetic data using the model given different topic distributions and hyper-parameters for . We generate 500 documents and each document has 100 words for each view. Given the generated data, a correct inference algorithm will be able to recover all the latent parameters. Figure 4 (a) shows the ground truth that we used for the per topic words distribution and the estimation of these latent variables using variational inference as described in Section 3.2. All the topics are correctly recovered. Due to the exchangeability of Dirichlet distribution, the estimation gives different order of the topics which is shown as row-wise exchanges in Figure 4(b). Figure 4 shows the parameter recovery for the partition parameters and
which are generated from beta distribution. In the example, we useand as hyper-parameters for the beta distributions. In this setting, the first view is comparably clean; the second view is more noisy with big variations on the noise level among the data. As Figure 4 shows, almost all the partition parameters are correctly recovered.
4.2 Performance Evaluation using Real-World Data
In this section, model performance is evaluated on real-world data. We present two experimental groups. The first one is using the LabelMe natural scene dataset [18, 24] and the second one is using the Leeds butterfly dataset  for fine-grained classification. We focus on the model performance where we investigate the distribution of topics and partition parameters. This will provide us with insight into the data structure and model behavior. Thereafter, we will present the classification performance. In these experiments, the classification results are obtained by applying softmax regression on the topic representation. In all experimental settings, the hyper-parameters for the per document topic distributions are set to , the hyper-parameters for the per topic word distributions are set to and the hyper-parameters for the partition variables are set to 333 includes , and . includes , , and . includes and . . We also perform experiments with different features, including off-the-shelf CNN-features from different layers and traditional SIFT features. Here, we only present the results using off-the-shelf CNN conv5_1 features as an example. We use the pre-trained Oxford VGG 16-layer CNN 
for feature extraction. We create sliding windows in 3 scales with a 32 pixels step size to extract features, in the same manner as
, and use K-means clustering to create a codebook and represent each image using a bag-of-visual-words. The vocabulary size is 1024. In general, the performance is robust when higher layers are used and when the vocabulary size is sufficient. More results using different features and different parameter settings are enclosed in the supplementary material.
4.2.1 LabelMe Dataset.
We use the LabelMe Dataset as in [18, 25] for this group of the experiments. The LabelMe dataset contains 8 classes of images: highway, inside city, coast, forest, tall buildings, street, open country and mountain. For each class, 200 images are randomly selected, half of which are used for training, and half of which are used for testing. This results in 800 training and 800 testing images. We perform the experiment in two different scenarios: Image and Image, where only images are available; and Image and Annotation, where different modalities are available.
188.8.131.52 Image and Image.
In this experiment, we explore the scenario in which only one modality is available. We want to model essential information that captures the within class variations and explains away the instance specific variations. Both views are bag-of-CNN Conv5_1 feature representations of the image data. For each document, two training images from the same class are randomly paired. This represents the scenario as shown in the introductory Figure 1 (a). For the experimental results presented below, the numbers of topics are set to , , .444The performance is robust with a sufficient amount of topics, or higher. More results with different numbers of topics are presented in the supplement.
Figure 7 shows the histograms of the partition parameters in this case. Figure 7 (a) and (b) appear to be similar. This is according to intuition; since both views are images and they are randomly paired within the same classes, the statistical features are expected to be the same for both views. Most partition parameters are larger than 0.8, which means that large parts of information can be shared between images from the same class and that the CNN Conv5_1 features provide a good raw representation of the images. For image pairs with more variation that does not correlate with the image class, the partition parameters will be smaller. The essential information ratio varies among images which causes the partition parameters to vary among different images.
|DocNADE ||SupDocNADE||Full SVM||PCA15 SVM||LDA15||SWB15 ||IBTM15|
Figure 5 visualizes the document distribution in different topic representation spaces. Figure 5 (a) shows that documents from different classes are well separated in the space defined by the shared topic representation. Figure 5 (b) and (c) show that documents from different classes are more mixed in the private topic spaces. Thus, the private information is used to explain instance specific features of a data point, but not class-specific features – these have been pushed into the shared space, according to the intention of the model. The variations in the private spaces are small due to the low noise ratio in the dataset. For the classification performance where only images are available, using IBTM with classification using only the shared representation leads to a classification rate of . The classification results are summarized in Table 1. A standard LDA obtains better performance than PCA with the same number of dimensions. IBTM outperforms LDA with the same number of topics and can even obtain better results than using the full dimension (1024) of bag-of-Conv5_1 features together with linear SVM. While using SWB  555 We implemented SWB using Gibbs Sampling following the description in the paper . The parameter settings are the same as in . Linear SVM is used for classification using the topic representation from SWB. More analysis using SWB is presented in the supplementary material of this paper. , the performance is unsatisfactory for such computer vision tasks due to the noisy properties of images. The results show that IBTM is able to learn a factorized latent representation, which separates task-relevant variation in the data from variation that is less relevant for the task at hand, here classification.
184.108.40.206 Image and Annotation.
In this experiment, we explore the scenario when two different modalities are available for different views. We use the bag-of-Conv5_1 representation of images as the first view and the image annotations as the second view. The word counts for the annotations are scaled with the annotated region. For each document, 79 Conv5_1 features are extracted from the image view, and the sum over the word histogram for each view is normalized to 100. The number of topics is set to , , in the experimental results presented here. Figure 7 shows the histograms of the partition parameters and for the two views respectively. Figure 7 (b) shows that the partition parameters are more concentrated around large values compared to Figure 7 (a), which indicate that most annotation information is more essential. This is consistent with the intuition of the relative noise levels in image vs annotation data.
|Full SVM||PCA 15||LDA15||SWB 2V ||IBTM15 1V||IBTM15 2V|
Figure 8 shows the distribution of documents using different topic representations. As in the previous experiment, documents from different classes are well separated in the shared topic representation and are more mixed in the private topic representations. Table 2 summarizes the classification performance.666 The difference of Full SVM performance in Table 1 and Table 2 were due to different random data partitions. IBTM is able to outperform other methods with a performance of even when only images are available for testing. When both modalities are available, the performance goes up to , while ideal classification by humans for this dataset is reported to be in .
4.2.2 Leeds Butterfly Dataset.
In this section, the Leeds butterfly dataset  is used to evaluate the IBTM model on a fine-grained classification task. This dataset contains 10 classes of butterfly images collected from Google image search, both the original images with cluttered background and segmentation masks for the butterflies are provided in the dataset. For each class, 55 to 100 images have been collected and there are 832 images in total. In this experiment, 30 images are randomly selected from each class for training and the remaining 532 images are used for testing. Similarly to above, we perform the experiment in two different scenarios: Image and Image, where only the natural images with cluttered backgrounds are available; and Image and Segmentation, where one modality is the natural image and the other modality is the segmented image.
220.127.116.11 Image and Image.
In this experiment, we use only the natural images to evaluate the model performance in the uni-modal scenario. The experimental setting is similar to Section 18.104.22.168, where two images from the same class are paired randomly. , and are used for the results presented here. The histograms in Figure 11 are to the previous dataset, however, with smaller values. As natural images of butterflies have more background information that is not related to the class of the butterfly, while for the LabelMe dataset, almost the whole image has information contributing to the natural scene class.
Figure 9 visualizes the image distribution in the different topic representations, where the shared topic representation separates images from different classes better than the private ones. Table 3 summarizes the classification performance for this dataset. There ”II IBTM 15” shows the result of IBTM using only natural images, which obtains the highest performance in this uni-modality setting with only 15 topics.
|NLD 777 Learning Models for Object Recognition from Natural Language Descriptions (NLD) trained a classification model based on text descriptors. All images are tested to use visual information to extract attributes to fit the text template for testing. The experiment setting is different from our experiments. However, we include the result from the original paper for completeness.||Full SVM||PCA 15||II SWB15 ||IS SWB15 ||LDA15||II IBTM15||IS IBTM 1V||IS IBTM 2V|
22.214.171.124 Image and Segmented Image.
In this experimental setting, natural images and segmented images are used as two different views for training to demonstrate the multi-modality scenario. The segmented images are used as the first view and the natural images are used as the second view. Since the model is symmetric, the order of the views has no impact on the model. Figure 11 shows the histogram of the partition parameter. It is apparent that the partition parameters of the segmented images are more concentrated around the large values. Thus, the model has learned that the segmented images contain more relevant information. This is consistent with human intuition. Figure 12 shows the topic distribution using shared and private latent representations where the shared topic representations for different classes are naturally separated. Classification performance is summarized in Table 3
. SWB performs better with this dataset than with the LabelMe dataset. The reason for this is probably that the visual words here are less noisy than in LabelMe. ”IS IBTM15” denotes the performance of testing with only natural images and ”IS IBTM15” shows the performance of testing with both natural images and their segmentation. We can see that IBTM performs better than other methods even if only natural images are available for testing. With the segmentation, the performance is almost ideal.
In this paper, we proposed a different variant of the topic model IBTM with a factored latent representation. It is able to model shared information and private information using different views which has been proven to be beneficial for different computer vision tasks. Experimental results show that IBTM can effectively encode the task-relevant information. Using this representation, the state-of-the-art results are achieved in different experimental scenarios.
In this paper, the focus lay on exploring the concept of factorized representations and the experiments were centered around two view scenarios. In future work, we plan to evaluate the performance of IBTM by using any number of views and in different scenarios such as cue-integration. In the end, efficient inference algorithms are the key for probabilistic graphic models in general. In this paper, we used variational inference in a batch manner. In the future, more efficient and robust inference algorithms [29, 30] can be explored.
-  Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. PAMI 35(8) (2013) 1798–1828
Tipping, M.E., Bishop, C.M.:
Probabilistic Principal Component Analysis.Journal of the Royal Statistical Society 61(3) (1999) 611–622
-  Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. JMLR 3 (2003) 993–1022
Gaussian Process Latent Variable Models for visualisation of high dimensional data.In: NIPS. (2004) 329–336
-  Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR. Volume 2., IEEE (2005) 524–531
-  Hospedales, T.M., Gong, S.G., Xiang, T.: Learning tags from unsegemented videos of multiple human actions. In: ICDM. (2011)
-  Zhang, C., Song, D., Kjellstrom, H.: Contextual Modeling with Labeled Multi-LDA. In: IROS, Tokyo, Japan (2013)
-  Tucker, L.R.: An Inter-Battery Method of Factory Analysis. Psychometrika 23 (June 1958)
-  Damianou, A., Ek, H.C., Titsias, M., Lawrence, N.D.: Manifold Relevance Determination. In: ICML. (2012) 145–152
-  Zhang, C., Ek, C.H., Damianou, A., Kjellstrom, H.: Factorized Topic Models. In: ICLR. (2013)
-  Virtanen, S., Jia, Y., Klami, A., Darrell, T.: Factorized multi-modal topic model. arXiv preprint arXiv:1210.4920 (2012)
-  Zhang, C., Kjellstrom, H.: How to Supervise Topic Models. In: ECCV workshop on Graphical Models in Computer Vision. (2014)
-  Blei, D.M., McAuliffe, J.D.: Supervised Topic Models. arXiv preprint arXiv:1003.0783 (2010)
-  Zhang, C., Ek, C.H., Gratal, X., Pokorny, F.T., Kjellström, H.: Supervised hierarchical Dirichlet processes with variational inference. In: ICCV workshop on Inference for probabilistic graphical models. (2013)
-  Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: NIPS. (2008)
-  Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. In: ICML. (2009)
-  Zhu, J., Chen, N., Perkins, H., Zhang, B.: Gibbs max-margin supervised topic models with fast sampling algorithms. In: ICML. (2013)
-  Wang, C., Blei, D., Fei-Fei, L.: Simultaneous image classification and annotation. In: CVPR, IEEE (2009) 1903–1910
-  Wang, Y., Mori, G.: Max-margin Latent Dirichlet Allocation for Image Classification and Annotations. In: BMVC. (2011) 7
-  Blei, D.M., Jordan, M.I.: Modeling annotated data. In: International Conference on Research and Development in Information Retrieval, ACM (2003) 127–134
-  Ek, C.H., Rihan, J., Torr, P., Rogez, G., Lawrence, N.: Ambiguity modeling in latent spaces. In: Machine learning for multimodal interaction. Springer (2008) 62–73
-  Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS. Volume 19. (2006) 241–248
-  Blei, D., Lafferty, J.: Correlated topic models. In: NIPS. Volume 18. (2006) 147
-  Li, L.J., Su, H., Lim, Y., Fei-Fei, L.: Objects as attributes for scene classification. In: Trends and Topics in Computer Vision. Springer (2012) 57–69
-  Zheng, Y., Zhang, Y.J., Larochelle, H.: Topic Modeling of Multimodal Data: An Autoregressive Approach. In: CVPR. (June 2014) 1370–1377
-  Wang, J., Markert, K., Everingham, M.: Learning models for object recognition from natural language descriptions. In: BMVC. (2009)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
-  Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: ECCV, Springer (2014) 392–407
-  Minka, T.P.: Divergence measures and message passing. In: Microsoft Research Technical Report. (2005)
-  Hoffman, M.D., Blei, D.M.: Structured stochastic variational inference. In: AISTATS. (2015)