MultiDEC: Multi-Modal Clustering of Image-Caption Pairs

01/04/2019 ∙ by Sean Yang, et al. ∙ University of Washington 0

In this paper, we propose a method for clustering image-caption pairs by simultaneously learning image representations and text representations that are constrained to exhibit similar distributions. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce but free-text descriptions are common. MultiDEC initializes parameters with stacked autoencoders, then iteratively minimizes the Kullback-Leibler divergence between the distribution of the images (and text) to that of a combined joint target distribution. We regularize by penalizing non-uniform distributions across clusters. The representations that minimize this objective produce clusters that outperform both single-view and multi-view techniques on large benchmark image-caption datasets.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many science and engineering applications, images are equipped with free-text descriptions, but structured training labels are difficult to acquire. For example, the figures in the scientific literature are an important source of information [Sethi et al.2018, Lee et al.2017], but no training data exists to help models learn to recognize particular types of figures. These figures are, however, equipped with a caption describing the content or purpose of the figure, and these captions can be used as a source of (noisy) supervision.

Grechkin et al. used distant supervision and co-learning to jointly train an image classifier and a text classifier, and showed that this approach offered improved performance

[Grechkin, Poon, and Howe2018]

. However, this approach relied on an ontology as a source of class labels. No consensus on an ontology exists in specialized domains, and any ontology that does exist will change frequently, requiring re-training. Our goal is to perform unsupervised learning using only the image-text pairs as input.

A conventional approach is to cluster the images alone, ignoring the associated text. Unsupervised image clustering has received significant research attention in computer vision recently

[Xie, Girshick, and Farhadi2016, Yang et al.2017, Yang, Parikh, and Batra2016]. However, as we will show, these single-view approaches fail to produce semantically meaningful clusters on benchmark datasets. Another conventional solution is to cluster the corresponding captions using NLP techniques, ignoring the content of the images. However, the free-text descriptions are not a reliable representation of the content of the image, resulting in incorrect assignments.

Current multi-modal image-text models focus on matching images and corresponding captions for information retrieval tasks [Karpathy and Fei-Fei2015, Dorfer et al.2018, Carvalho et al.2018], but there is less work on unsupervised learning for both images and text. Jin et al. [Jin et al.2015] solved a similar problem where they utilized Canonical Correlation Analysis (CCA) to characterize correlations between image and text. However, the textual information for the model were explicit tag rather than long-form free-text descriptions. Unlike tags, free-text descriptions are extremely noisy: they always contain significant irrelevant information, and may not even describe the content of the image.

We propose MultiDEC, a clustering algorithm for image-text pairs that considers both visual features and text features and simultaneously learns representations and cluster assignments for images. MultiDEC extends prior work on Deep Embedded Clustering (DEC) [Xie, Girshick, and Farhadi2016]

, which is a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping function from the data space to a lower-dimensional feature space in which it iteratively optimizes Kullback-Leibler divergence between embedded data distribution and a computed target distribution. DEC has shown success on clustering several benchmark datasets including both images and text (separately).

Despite its utility, in our experiments DEC may generate empty clusters or assigns clusters to outlier data points, which is a common problem in clustering tasks

[Dizaji et al.2017, Caron et al.2018]. We address the problem of empty clusters by introducing a regularization term to force the model to find a solution with a more balanced assignment.

We utilize a pair of DEC models to take data from image and text. Derived from the target distribution in [Xie, Girshick, and Farhadi2016]

, we propose a joint distribution for both the embedded image features and the text features. MultiDEC simultaneously learns the image representation by iterating between computing the joint target distribution and minimizing KL divergence between the embedded data distribution to the computed joint target distribution. We evaluate our method with four benchmark datasets and compare to both single-view and multi-view methods. Our method shows significantly improvement over other algorithms in all large datasets.

In this paper, we make the following contributions:

  • We propose a novel model, MultiDEC, that considers semantic features from corresponding captions to simultaneously learn feature representations and cluster centroids for images. MultiDEC iterates between computing a joint target distribution from image and text and minimizing the regularized KL divergence between the soft assignments and the joint target distribution.

  • We run a battery of experiments to compare our method to multiple single-view and multi-view algorithms on four different datasets and demonstrate the superior performance of our model.

  • We conduct a qualitative analysis to show that MultiDEC separates the semantically and visually similar data points and is robust to noisy and missing text.

2 Related Work

We consider related work in both single-view and multi-modal methods.

Single-View Image Clustering

Image Clustering with deep neural networks (DNN) has received increasing interest in recent studies. Deep Embedded Clustering (DEC) [Xie, Girshick, and Farhadi2016] simultaneously learns cluster centroids and DNN paramaters by iterating between computing an auxiliary target distribution and minimizing Kullback-Leibler divergence to it. [Guo et al.2017] then proposed Improved Deep Embedded Clustering (IDEC), an enhanced version of DEC that preserves the local embedding structure by applying an under-complete autoencoder. DCN was presented by [Yang et al.2017]

where the model simultaneously learns to cluster and reduce dimensionality to facilitate K-means cluster analysis. Dizji et al. introduced DEPICT

[Dizaji et al.2017]

which consists of a multinomial logistic regression function stacked on top of a multilayer convolutional autoencoder and minimize regularized KL divergence to map data into a discriminative embedding subspace and predict cluster assignments. A new clustering framework, JULE, was demonstrated by

[Yang, Parikh, and Batra2016]

, which recurrently updating cluster results with agglomerative clustering and image representations from convolutional neural networks. JULE, however, is not able to scale due to the huge computation cost of affinity calculation and the requirement of a large amount of hyperparamters tuning which is not practical in real world clustering settings. Unlike our model, these models fail to consider semantic components of the images and are mostly tested on toy datasets.

Multi-View Image-Text Learning

Joint embedding of image and text models have been increasingly popular in applications including image captioning [Mao et al.2015, Xu et al.2015, Karpathy and Fei-Fei2015], question answering [Antol et al.2015], and information retrieval[Karpathy and Fei-Fei2015, Dorfer et al.2018, Carvalho et al.2018]. DeVise [Frome et al.2013]

is the first method to generate visual-semantic embeddings that linearly transform a visual embedding from a pre-trained deep neural network into the embedding space of textual representation. The method begins with a pre-trained language model, then optimizes the visual-semantic model with a combination of dot-product similarity and hinge rank loss as the loss function. After DeVise, several visual semantic models have been developed by optimizing bi-directional pairwise ranking loss

[Kiros, Salakhutdinov, and Zemel2014, Wang, Li, and Lazebnik2016] and maximum mean discrepancy loss[Tsai, Huang, and Salakhutdinov2017]. Maximizing CCA (Canonical Correlation Analysis) [Hardoon, Szedmak, and Shawe-Taylor2004] is also a common way to acquire cross-modal representation. Yan et al. [Yan and Mikolajczyk2015] address the problem of matching images and text in a joint latent space learned with deep canonical correlation analysis. Dorfer et al. [Dorfer et al.2018] develop a canonical correlation analysis layer and then apply pairwise ranking loss to learn a common representation of image and text for information retrieval tasks. However, most image-text multi-modal studies focus on matching image and text. Few methods study the problem of unsupervised clustering of image-text pairs.

Jin et al. addressed a related problem where they aim to cluster images by integrating the multimodal feature generation with the Locality Linear Coding (LLC) and co-occurrence association network, multimodal feature fusion with CCA, and accelerated hierarchical k-means clustering [Jin et al.2015]. However, the text data they handled are tags instead of longer, noisy, and unreliable free-text descriptions as we do in MultiDEC. Grechkin et al. proposed EZLearn [Grechkin, Poon, and Howe2018], a co-training framework which takes image-text data and an ontology to classify images using labels from the ontology. This model requires prior knowledge of the data in order to derive an ontology; this prior knowledge is not always available, and can significantly bias the results toward the clusters implied by the ontology.

3 Method

Figure 1: Overview of our method. MultiDEC includes two phases: parameter initialization and clustering. DNN parameters and centroids are initialized using a stacked reconstructing autoencoder and K-means on the embedded data points. In the clustering phase, parameters and centroids are updated by iterating between computing a joint target distribution and minimizing KL divergence. This figure is best viewed in color.

Figure 1 shows an overview of our method. MultiDEC clusters data by simultaneously learning DNN parameters and that map data points and to and and a set of centroids and in the latent space and , respectively. Following [Xie, Girshick, and Farhadi2016], our method includes two phases: (1) Parameter initialization. We initialize and with two stacked autoencoders to learn meaningful representations from both views and apply K-means to obtain initial centroids. (2) Clustering. MultiDEC updates the DNN parameters and centroids by iterating between computing joint auxiliary target distribution and minimizing the Kullback-Leibler (KL) divergence to the calculated target distribution. Details regarding each phase will be elaborated as below.

Parameter Initialization

We initialize DNN parameters with two stacked autoencoders (SAE). A stacked autoencoder has shown success in generating semantically meaningful representation in several studies (c.f., [Vincent et al.2010, Le2013, Xie, Girshick, and Farhadi2016]). We utilize a symmetric stacked autoencoder to learn the initial DNN parameters for each view by minimizing mean square error loss between the output and the input. After training the autoencoder, we discard the decoder, pass data and through trained encoder and apply K-means to the embeddings on and to obtain initial centroids and .

With the initialization of DNN parameters and centroids, MultiDEC updates the parameters and the centroids by iterating between computing a joint target distribution and minimizing a (regularized) KL divergence of both data views to it. In the first step, we compute soft assignments (i.e., a distribution over cluster assignments) for both views. The process stops when convergence is achieved.

Soft Assignment

Following Xie et al. [Xie, Girshick, and Farhadi2016]

, we model the probability of data point

being assigned to cluster using the Student’s t-distribution [Maaten and Hinton2008], producing a distribution ( for images and for text).


where and are the soft assignments for image view and text view, respectively, and

is the number of degrees of freedom of the Student’s t-distribution.

is the embedding on latent space of data , which can be described as . is the embedding on latent space of data , which can be illustrated as . Following Xie et al., we set to 1 because we are not able to tune it in an unsupervised manner.

Aligning Image Clusters and Text Clusters

After calculating the soft assignments for both views, we need to align the two sets of clusters. This cluster alignment is obtained from the highest probability cluster (i.e., image is assigned to cluster ). Next, to align image clusters and text clusters, we use the Hungarian algorithm to find the minimum cost assignment. We create a confusion matrix where an entry represents the number of data points being assigned to -th image cluster and -th text cluster. We then subtract the maximum value of the matrix from the value of each cell to obtain the ”cost.” The Hungarian algorithm is then applied to the cost matrix.

KL Divergence Minimization

Xie et al. trained DEC by minimizing the KL divergence between the soft assignment and a target distribution (presented below in Eq. 8), with a goal of purifying the clusters by learning from high confidence assignments:


DEC fails to address the issue of trivial solutions and empty clusters which happen frequently in clustering problems [Dizaji et al.2017, Caron et al.2018]. Dizaji et al. [Dizaji et al.2017] used a regularization term to penalize non-uniform cluaster assignments. Following this concept, we define a target label distribution by averaging the joint target distribution from all data points.


where can be interpreted as the prior frequency of clusters in the joint target distribution. To impose the preference of a balanced assignment, we add a term representing the KL divergence from a uniform distribution . The regularized KL divergence is computed as


where the first term aims to minimize the dissimilarity between soft assignment and joint target distribution and the second term is to force the model to prefer a balanced assignment. The uniform distribution can be replaced with other distribution if there is any prior knowledge of the cluster frequency.

MultiDEC is trained by matching the image distribution to the joint distribution , and similarly for the text distribution .


At this point, we have presented half of the iteration: how MultiDEC generates the soft assignments, and the objective function of the image and text models. Next, we compute a joint target distribution from both views.

Dataset # Points # Categories average # words % of largest Class % of smallest Class
Coco-cross 7429 10 50.5 23.2 % 1.6%
Coco-all 23189 43 50.4 7.4% 0.4%
Pascal 1000 20 48.9 5.0% 5.0%
RGB-D 1449 13 38.5 26.4% 1.7%
Table 1: Dataset statistics.

Joint Target Distribution

Xie et al. proposes a target distribution [Xie, Girshick, and Farhadi2016] which aims to improve cluster purity and to emphasize data points with high assignment confidence:


where .

To fit the model with multi-view problem setting, we propose a joint target distribution:


where and are soft cluster frequencies for image view and text view, respectively. With this joint target distribution, MultiDEC is able to take both sources of information into account during training.

Some images do not have associated text; we want the model to be robust to this situation. Missing text causes the second term in equation (9) to be 0 and the data points with text would have higher value of and contribute a larger gradient to the model. We will discuss this issue in more detail in Section 5.

4 Experiments

We evaluate our method with four datasets and compare to several single-view and multi-view algorithms. In these experiments, we aim to verify the effectiveness of MultiDEC on real datasets, validate that MultiDEC outperforms single-view methods and state-of-the-art multi-view methods.

4.1 Datasets

To evaluate our method, we use datasets that have images with corresponding captions as well as ground-truth labels to define the clusters. Our proposed model is tested with four datasets from three different sources and compared against several single-view and multi-view algorithms. We summarize the results in Table 1.

  • Coco-cross [Lin et al.2014]: MSCOCO is a large-scale object detection, segmentation, and captioning dataset. There are five sentences of captions per image. There are 80 object categories and all these categories belong to 10 supercategories. Every image includes at least one object. We discard images containing multiple objects. For this subset, we pick the category with the largest number of images in each supercategory, which are stop sign, airplane, suitcase, pizza, cell phone, person, giraffe, kite, toilet, and clock. There are 7,429 data points from these 10 categories in total.

  • Coco-all [Lin et al.2014]: For this subset of MSCOCO, similar to Coco-cross, we remove images with more than one object, while we keep all categories that include more than 100 images. We are able to compile a dataset with 23,189 images from 43 categories.

  • Pascal [Rashtchian et al.2010]: This dataset contains 1,000 images with 20 categories, 50 images each category. Every images comes with 5 sentences caption.

  • SentencesNYUv2 (RGB-D) [Nathan Silberman and Fergus2012, Kong et al.2014]: This dataset includes 1449 images with 13 indoor scenes. Every image is captioned with a paragraph which describes the content of the image. Compared to Coco and Pascal datasets, the captions in this dataset are less specific to the categories and significantly less reliable as a source of information.

Coco-cross Coco-all Pascal RGB-D
Single-view (Image)
ResNet-50 + KM 0.647 0.712 0.558 0.519 0.614 0.442 0.486 0.516 0.307 0.353 0.289 0.161
ResNet-50 + DEC 0.649 0.629 0.670 0.472 0.701 0.429 0.418 0.564 0.311 0.421 0.352 0.236
Single-view (Text)
Doc2Vec + KM 0.720 0.852 0.737 0.613 0.807 0.589 0.544 0.602 0.398 0.438 0.384 0.279
Doc2Vec + DEC 0.720 0.843 0.729 0.557 0.738 0.501 0.295 0.294 0.120 0.429 0.383 0.287
VSE + KM 0.665 0.736 0.607 0.520 0.628 0.430 0.479 0.508 0.300 0.388 0.318 0.194
CCA + KM 0.712 0.822 0.703 0.645 0.817 0.603 0.442 0.485 0.238 0.388 0.310 0.186
CCAL- + KM 0.699 0.806 0.689 0.641 0.812 0.587 0.446 0.489 0.224 0.404 0.316 0.196
MultiDEC (ours) 0.852 0.885 0.878 0.668 0.801 0.666 0.499 0.587 0.377 0.521 0.442 0.378
Table 2: Clustering performance of several single-view and mutli-view algorithms on four datasets. The results reported are the average of 10 iterations. MultiDEC outperforms the comparing methods on three datasets by a large margin. The insufficient performance from DNN models on Pascal dataset might be caused by insufficient amount of data.

We use ResNet-50 [He et al.2016]

, pretrained on 1.2M ImageNet

[Deng et al.2009] corpus, for extracting 2048 dimensional image features and doc2vec [Le and Mikolov2014], which is pre-trained on Wikipedia via skip-gram, to embed captions and obtain text features. Recent studies have shown image features embedded by ImageNet pretrained models improve general image clustering tasks and ResNet-50 features are superior than representation extracted from other state-of-the-art CNN architectures [Guérin et al.2017, Guérin and Boots2018]. Doc2vec also has shown to produce effective representations for long text paragraphs [Lau and Baldwin2016].

4.2 Competitive Methods

We compare our method to a variety of single-view and multi-view methods.

Single-view methods

We run two single-view methods to serve as baseline comparison.

  • K-means(KM) [Lloyd1982]: We applied K-means on both image ResNet-50 features and text doc2vec features to provide a general assessment of the input feature. To reduce the high number of dimensions that can cause clustering algorithms to produce empty clusters [Bellman1961, Steinbach, Ertöz, and Kumar2004]

    , we reduce the dimension of the 2048-d ResNet-50 image features to 50-d by Principal Component Analysis (PCA)


  • Deep Embedded Clustering (DEC) [Xie, Girshick, and Farhadi2016]: DEC simultaneously learns feature representations and cluster assignments of the data by minimizing KL divergence between data distribution and an auxiliary target distribution. We then apply K-means to output. We show results for both text and image inputs.

Multi-view methods

We evaluate three state-of-the-art multi-view methods. Current methods for matching image and text models are based on minimizing ranking loss or maximizing CCA between text and image.

  • Unifying Visual-Semantic Embeddings (VSE) + K-means [Kiros, Salakhutdinov, and Zemel2014]: Kiros et al. proposed a pipeline which unifies joint image-text embedding models by minimizing pairwise ranking loss. The joint embedding is a 1024 dimensional space, so we apply PCA and reduce the dimension of the embedding to 50-d. K-means is further implemented to acquire the cluster centroids on the reduced dimensional embedding.

  • Deep Canonical Correlation Analysis (DCCA) + K-means [Andrew et al.2013, Wang et al.2015]: This method learns complex nonlinear transformations of two views of data via maximizing the regularized total correlation. The cluster assignments are obtained with K-means on the joint representations.

  • Canonical Correlation Layer Optimized with Pairwise Ranking Loss (CCAL-) + K-means [Dorfer et al.2018]: This model includes a Canonical Correlation Analysis(CCA) layer which combines pairwise ranking loss with the optimal projections of CCA. K-means is then applied to the learned embeddings.

4.3 Evaluation Metrics

All experiments are evaluated by three standard clustering metrics: clustering accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). For all metrics, higher numbers indicates better performance.

We use hyperparameter settings following Xie et al.

[Xie, Girshick, and Farhadi2016]. For baseline algorithms, we use the same setting in their corresponding paper. All the results are the average of 10 trials.

4.4 Experimental Results

Table 2 displays the quantitative results for different methods on various datasets. MultiDEC outperforms other tools on almost every dataset by a noticeable margin with an exception of Pascal dataset. All the DNN models suffer from poor performance on Pascal dataset. Our interpretation is that there is not sufficient data (1,000 data points with 50 images per cluster) for the DNNs to converge. However, MultiDEC still surpasses other DNN models on this dataset.

5 Discussion

In this section, we discuss additional experiments to expand on MultiDEC’s effectiveness.

5.1 Qualitative Comparison

Figure 2: Qualitative visualization of the latent representation of MultiDEC, DEC, and CCA.(Color encoding is based on groundtruth labels.) We can observe that MultiDEC successfully separates overlapped data points in original latent space and generates semantic meaningful clusters. DEC has trouble with separating kite from airplane and giraffe from pizza. While CCA is able to gather semantic similar images, but the latent space is still difficult for clustering analysis with unclear bondaries between clusters.

The cluster metrics are difficult to interpret, so we are interested in exploring a qualitative comparison between MultiDEC and the best single-view and multi-view competitors, DEC and CCA, respectively. Figure 2 is a visualization of the latent space of MultiDEC to illustrate its effectiveness in producing coherent clusters. We use t-SNE to visualize the embeddings from the latent space with perplexity = 30. The positions and shapes of the clusters are not meaningful due to the operation of t-SNE. Both DEC and MultiDEC are able to generate distinct clusters, but DEC appears to have many more false assignments. For example, DEC can struggle to differentiate giraffe from pizza and kite from airplane. CCA is able to gather semantically similar images together, but the cluster boundaries are much less distinct.

Figure 3: The 5 highest-confidence images in each cluster from MultiDEC (left), DEC (midle), and CCA(right). MultiDEC clusters appear qualitatively better. For example, airplanes and kites, two visually and semantically similar concepts, are clearly distinguished, while DEC appears to struggle to distinguish giraffes and pizza.

We further compare three algorithms by inspecting examples of the clusters. Figure 3 shows the top five images with highest confidence from each cluster from the Coco-cross dataset. The ten categories in Coco-cross are stop sign, airplane, suitcase, pizza, cell phone, person, giraffe, kite, toilet, and clock. The figure shows that DEC clusters are not always coherent. For example, cluster #1 and cluster #7 seem to include mostly airplane images and cluster #0 and cluster #4 are clock clusters. Cluster #9 from DEC is a fusion of giraffe and pizza, which are semantically significantly different. Our guess is that both giraffe and pizza share similar colors (yellow) and patterns (spots on the giraffe body and toppings on the pizza). MultiDEC, on the other hand, is easily able to distinguish these objects, because the text descriptions expose their semantic differences. Surprisingly, MultiDEC is also able to distinguish airplane and kite, which are not only visually similar, but are also semantically related. However, we are still able to observe some errors from MultiDEC, such as examples of suitcase and cellphone, which are visually similar, assigned into the same cluster (cluster #1) and clock examples separated into two clusters: clocks on towers (cluster #2) and indoors clocks (cluster #9). As we saw in Figure 2, the cluster boundaries are indistinct in CCA latent space, and the qualitative results shown in Figure 3 corroborate this result. We can see several different clusters include similar objects; for example, both cluster #1 and cluster #9 include airplanes and cluster #0 and cluster #7 include giraffes.

5.2 Model Robustness

Model Sensitivity to Text Features

Input Text Features Coco-cross Coco-all Pascal RGB-D
Doc2Vec 0.822 0.668 0.499 0.521
TF-IDF 0.801 0.685 0.556 0.519
FastText 0.868 0.674 0.481 0.449
Table 3:

Experiment results on model sensitivity to text features vectors. MultiDEC remains similar performance among different text embedding algorithms.

We use learned embeddings for text as input to the model. To examine the model’s sensitivity to the quality of the input text representations, we experiment with two other baseline text representations, TF-IDF and FastText [Bojanowski et al.2017]. TF-IDF ignores all co-occurrence relationships, and therefore has significantly less semantic content, so we expect performance to be worse. We produce a 2000-d vector for text features for each data point. FastText is a word embedding model known for fast training and competitive performance. We embed words with FastText and average all the word vectors [Joulin et al.2017] to represent paragraphs. Table 3 shows the results of MultiDEC trained using these different text features. MultiDEC produces similar results despite different text features, which demonstrates that the performance is the result of the MultiDEC algorithm itself rather than the quality of the input text features.

Figure 4: Experiment results on model robustness to missing text. The clustering accuracy holds even with very little data, which verify our hypothesis in method section.

Robustness to Missing Text

Incomplete views are a very common problem in mutli-view clustering [Xu, Tao, and Xu2015]. In realistic settings, not all images will be equipped with text descriptions, and we want MultiDEC to degrade gracefully in these situations. To analyze the robustness of MultiDEC when text descriptions are missing, we remove text from a random set of images at varying rates. We expect that performance will degrade as we remove text labels — if it did not, then MultiDEC would not be making use of the information in the text labels. The results are shown in Figure 4, and we see that performance does indeed degrade, but not by a significant amount. This result shows that the joint target distribution can work with either or both sources of information (Equation 9). Images with missing text have smaller value of because the second term in equation (9) is ignored, while images with captions have larger value of and contribute larger gradient to the model.

Figure 5: MultiDEC performance when swapping the text labels for a random portion of the image-text pairs. The model performance remains high until over 60% of the input text are scrambled.

We also ran an experiment to test the robustness to noisy text by scrambling the image-text pairs, such that a given percentage of the images would be associated with the text of a different image. This change is more adversarial than missing text, as the incorrect labels could train the model to learn incorrect signals. (Figure 5). The performance of MultiDEC remains steady until almost 60% of the text is perturbed, indicating that MultiDEC is robust to incorrect labels.

6 Conclusion

We present MultiDEC, a method that learns representations from multi-modal image-Text pairs for clustering analysis. MultiDEC consists a pair of DEC models to take data from image and text, and works by iteratively computing a proposed joint target distribution and minimizing KL divergence between the embedded data distribution to the computed joint target distribution. We also address the issue of empty cluster by adding a regularized term to our KL divergence loss. MultiDEC demonstrates superior performance on various datasets and outperforms single view algorithms and state-of-the-art multi-view models. We further examine the robustness of MultiDEC to input text features, missing and noisy text. Our experimental results indicate that MultiDEC is a promising model for image-text pair clustering.


  • [Andrew et al.2013] Andrew, G.; Arora, R.; Bilmes, J.; and Livescu, K. 2013. Deep canonical correlation analysis. In ICML.
  • [Antol et al.2015] Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; and Parikh, D. 2015. Vqa: Visual question answering. In ICCV.
  • [Bellman1961] Bellman, R. 1961. Adaptive control process: a guided tour.
  • [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.
  • [Caron et al.2018] Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018. Deep clustering for unsupervised learning of visual features. In ECCV.
  • [Carvalho et al.2018] Carvalho, M.; Cadène, R.; Picard, D.; Soulier, L.; Thome, N.; and Cord, M. 2018. Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings. In SIGIR.
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
  • [Dizaji et al.2017] Dizaji, K. G.; Herandi, A.; Deng, C.; Cai, W.; and Huang, H. 2017. Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In ICCV.
  • [Dorfer et al.2018] Dorfer, M.; Schlüter, J.; Vall, A.; Korzeniowski, F.; and Widmer, G. 2018. End-to-end cross-modality retrieval with cca projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval 7(2):117–128.
  • [Frome et al.2013] Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Mikolov, T.; et al. 2013. Devise: A deep visual-semantic embedding model. In NIPS.
  • [Grechkin, Poon, and Howe2018] Grechkin, M.; Poon, H.; and Howe, B. 2018. Ezlearn: Exploiting organic supervision in large-scale data annotation. In IJCAI.
  • [Guérin and Boots2018] Guérin, J., and Boots, B. 2018. Improving image clustering with multiple pretrained cnn feature extractors. In BMVC.
  • [Guérin et al.2017] Guérin, J.; Gibaru, O.; Thiery, S.; and Nyiri, E. 2017. Cnn features are also great at unsupervised classification. arXiv preprint arXiv:1707.01700.
  • [Guo et al.2017] Guo, X.; Liu, X.; Zhu, E.; and Yin, J. 2017. Deep clustering with convolutional autoencoders. In ICONIP.
  • [Hardoon, Szedmak, and Shawe-Taylor2004] Hardoon, D. R.; Szedmak, S.; and Shawe-Taylor, J. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16(12):2639–2664.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [Jin et al.2015] Jin, C.; Mao, W.; Zhang, R.; Zhang, Y.; and Xue, X. 2015. Cross-modal image clustering via canonical correlation analysis. In AAAI.
  • [Jolliffe2011] Jolliffe, I. 2011. Principal component analysis. In International encyclopedia of statistical science. 1094–1096.
  • [Joulin et al.2017] Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2017. Bag of tricks for efficient text classification. In EACL.
  • [Karpathy and Fei-Fei2015] Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.
  • [Kiros, Salakhutdinov, and Zemel2014] Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics.
  • [Kong et al.2014] Kong, C.; Lin, D.; Bansal, M.; Urtasun, R.; and Fidler, S. 2014. What are you talking about? text-to-image coreference. In CVPR.
  • [Lau and Baldwin2016] Lau, J. H., and Baldwin, T. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368.
  • [Le and Mikolov2014] Le, Q., and Mikolov, T. 2014. Distributed representations of sentences and documents. In ICML.
  • [Le2013] Le, Q. V. 2013. Building high-level features using large scale unsupervised learning. In ICASSP.
  • [Lee et al.2017] Lee, P.-s.; Yang, S. T.; West, J. D.; and Howe, B. 2017. Phyloparser: A hybrid algorithm for extracting phylogenies from dendrograms. In ICDAR.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In ECCV.
  • [Lloyd1982] Lloyd, S. 1982. Least squares quantization in pcm. IEEE transactions on information theory 28(2):129–137.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne.

    Journal of machine learning research

  • [Mao et al.2015] Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; and Yuille, A. 2015.

    Deep captioning with multimodal recurrent neural networks (m-rnn).

    In ICLR.
  • [Nathan Silberman and Fergus2012] Nathan Silberman, Derek Hoiem, P. K., and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In ECCV.
  • [Rashtchian et al.2010] Rashtchian, C.; Young, P.; Hodosh, M.; and Hockenmaier, J. 2010. Collecting image annotations using amazon’s mechanical turk. In NAACL.
  • [Sethi et al.2018] Sethi, A.; Sankaran, A.; Panwar, N.; Khare, S.; and Mani, S. 2018.

    Dlpaper2code: Auto-generation of code from deep learning research papers.

    In AAAI.
  • [Steinbach, Ertöz, and Kumar2004] Steinbach, M.; Ertöz, L.; and Kumar, V. 2004.

    The challenges of clustering high dimensional data.

    In New directions in statistical physics. 273–309.
  • [Tsai, Huang, and Salakhutdinov2017] Tsai, Y.-H. H.; Huang, L.-K.; and Salakhutdinov, R. 2017. Learning robust visual-semantic embeddings. In ICCV.
  • [Vincent et al.2010] Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Manzagol, P.-A. 2010.

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of machine learning research 11(Dec):3371–3408.
  • [Wang et al.2015] Wang, W.; Arora, R.; Livescu, K.; and Bilmes, J. 2015. On deep multi-view representation learning. In ICML.
  • [Wang, Li, and Lazebnik2016] Wang, L.; Li, Y.; and Lazebnik, S. 2016. Learning deep structure-preserving image-text embeddings. In CVPR.
  • [Xie, Girshick, and Farhadi2016] Xie, J.; Girshick, R.; and Farhadi, A. 2016. Unsupervised deep embedding for clustering analysis. In ICML.
  • [Xu et al.2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2048–2057.
  • [Xu, Tao, and Xu2015] Xu, C.; Tao, D.; and Xu, C. 2015. Multi-view learning with incomplete views. IEEE Transactions on Image Processing 24(12):5812–5825.
  • [Yan and Mikolajczyk2015] Yan, F., and Mikolajczyk, K. 2015. Deep correlation for matching images and text. In CVPR.
  • [Yang et al.2017] Yang, B.; Fu, X.; Sidiropoulos, N. D.; and Hong, M. 2017. Towards k-means-friendly spaces: Simultaneous deep learning and clustering.
  • [Yang, Parikh, and Batra2016] Yang, J.; Parikh, D.; and Batra, D. 2016. Joint unsupervised learning of deep representations and image clusters. In CVPR.