The increasing development of image search engines, photo-sharing web sites, and desktop photo management tools, has made people easily access a large amount of images. However, image collections are usually unorganized, which makes finding desired photos and quick overview of an image collection very difficult. This unstructured nature of image collections has attracted great effort on computing visual summaries. On the other hand, most image collections are provided with rich text information, and such image collections are called tagged image collections in this paper. For example, images on Flickr are titled, tagged, and commented by users. Images from the Web are often associated with surrounding texts. The text information usually reflects the semantic content of images and is helpful for summarization.
In this paper, we address the image management task through a hybrid summarization scheme. The key is to find the summary in a hybrid way to exploit both visual and textual information. An example is shown in Fig. 1. Given rich tag information associated with images, there are three useful relations from images and tags: two homogeneous relations within images and tags, including image similarity and tag similarity, and one heterogeneous relation between images and tags, e.g., their association relations. We propose a hybrid summarization approach to find image exemplars through investigating all three relations together including the information from the associated tags, i.e., association relations between images and tags and relations within tags so that the summary is both visually and semantically satisfactory.
1.1 Related work
Most existing image sharing web sites present an overview of an image collection by showing the top images (e.g., Flickr group ), which obviously does not present a good summary, or allowing consumers to manually select images (e.g., Picassa web album ), which is inconvenient for consumers particularly in a large number of images.
Rother et al.  summarize a set of images with a “digital tapestry”. They synthesize a large output image from a set of input images, stitching together salient and spatially compatible blocks from the input images. Wang et al.  create a “picture collage”, a 2D spatial arrangement of the images in the input set chosen to maximize visible salient regions. These works do not address the problem of selecting the set of images to appear in the summary.
Recently, there are a few works to deal with the selection problem. Simon et al. 
selects a set of images using the greedy k-means algorithm, by examining the distribution of images to select a set of canonical views only based on visual features without exploiting the associated tags. Raguram and Lazebnik select iconic images to summarize general visual categories using a simple joint clustering technique from both appearance and semantic aspects. It first obtained two independent clusters from the visual feature and the textual feature, respectively, and then takes their intersection to get the final clustering, but the joint process is obtained sequentially instead of simultaneously. Surrounding texts are limitedly exploited for image grouping [9, 14] by considering the association relations between words and images using the co-clustering technique, but without investigating interior relations over tags.
Image summarization is also studied in the information retrieval community. Clough et al.  construct a hierarchy of images using only textual caption data, and the concept of subsumption. Schmitz  uses a similar approach but relies on Flickr tags. Jaffe et al.  summarize a set of images using only tags and geotags. By detecting correlations between tags and geotags, they are able to produce tag maps, where tags and related images are overlaid on a geographic map at a scale corresponding to the range over which the tag commonly appears. All these approaches could be used to further organize the images. However, none of them exploits the visual information.
1.2 Our approach
In this paper, we present a hybrid summarization approach to find a few image exemplars to represent the image collection, which is both visually and semantically satisfactory. Toward this end, we propose an effective scalar hybrid message propagation scheme over images and tags, homogeneous and heterogeneous message propagation (), to exploit simultaneously homogeneous relations within images and tags and heterogeneous relations between images and tags. It is beyond the affinity propagation algorithm  that only can handle homogeneous data, and can effectively exploit the heterogeneous relations between images and tags as well as the interior relations within tags. Moreover, the reduction from vector messages to scalar messages is more challenging than AP because contains additional heterogeneous relations. Besides, our approach is superior over existing co-clustering algorithms [6, 7] that only utilize the heterogeneous relations because 1) it directly obtains the exemplars instead of performing the necessary postprocess to find the centers followed by a clustering procedure and 2) it takes advantage of homogeneous relations with images and tags as well as heterogeneous relations between them.
Given a set of images, , a set of corresponding texts, , , and the union set of tags , we aim to find a summary, a set of image exemplars, . There are three types of relations over images and tags as depicted in Fig. 2. The heterogeneous relations between all pairs of associated images and tags are represented by the edges, . The homogeneous relations within images are represented by the edges, , and the similarity between a pair of images and is denoted by . The homogeneous relations within tags are represented by the edges, , and the similarity between a pair of tags and is represented by .
Suppose a set of image exemplars to be identified be denoted as , where is the exemplar image index of image , and is a label vector over images. If such a label vector satisfies a valid constraint that an image should also serve as the exemplar of itself if it is an exemplar of any other image, it would uniquely correspond to a set of image exemplars. In other words, identifying the exemplars can be viewed as searching over valid labels.
2 Affinity propagation
To make our approach easily understood, we first review affinity propagation (AP) . AP is an approach to find a good subset of exemplars for a whole set of homogeneous data points, by considering all data points as candidate exemplars such that they can represent the image collection very well, and mathematically it is formulated to maximize an objective function,
where is a fitting function to evaluate how well the image exemplars represent the other images, and written as
and is a valid configuration function to constrain that an image must select itself as its exemplar if it is selected as an exemplar of other data point, and it is formulated as
This objective function can be depicted by a factor graph over the variables , and the function terms and , which correspond to the subgraph in the dashed green box in Fig. 3. AP to maximize Eqn. (1) is a scalar message propagation algorithm and derived from the max-sum algorithm, which transits two vector-messages between and . The message, , sent from to , consists of real numbers, with one for each possible value of . The message, , sent from to , also consists of real numbers. For simplicity, the subscript may be dropped in the following presentation. The two messages are depicted in Figs. 4(a) and 4(b) neglecting the massages corresponding to the red edges, and formulated as follows,
As shown in , this vector-message propagation scheme can be reduced to a scalar-message propagation scheme between data points. There are two kinds of messages exchanged within image points. The “responsibility” , sent from data point to data point , which reflects how well serves as the exemplar of considering other potential exemplars for , and the “availability” , sent from data point to data point , which reflects how appropriately chooses as its exemplar considering other potential points that may choose as their exemplar. The messages are updated in an iterative way as
3 Hybrid summarization
Affinity propagation has been shown to be very effective to find exemplars , but it can only handle homogeneous data points. In the following, we present a new hybrid message propagation approach, to generalize AP to heterogeneous data points, homogeneous and heterogeneous message propagation (). is applied to find image exemplars for a tagged image collection, with the advantages of exploiting not only the visual information of images but also heterogeneous relations between images and their associated tags and interior similarities within words in a hybrid way.
We exploit the associated tag information for image identification by augmenting image exemplar identification with tag exemplar identification and bridging image exemplars and tag exemplars according to association relations between images and tags. Thus, we define a label vector over tags to represent the tag exemplars, i.e., selects as its exemplar. Basically, the proposed approach is modeled according to the following two properties: 1) these image and tag exemplars are good representatives of images and tags, respectively, and 2) these image and tag exemplars reflect their association relations. The second property investigates the association relations between images and tags and also serves as a bridge to make use of the relations within tags.
The first property concerns how well image and word exemplars represent the other images and tags if only the homogeneous relations are taken into consideration. For images, this is formulated as in Eqn. (1), and for tags we can get a similar formulation,
The second property essentially investigates the effect of the heterogeneous relations between images and tags on exemplar identification, and serves as a bridge to get help for image exemplar identification from tag information. We would like to assign different preferences for a pair of connected image and tag according to whether image or tag selects itself as its exemplar. This affect is formulated as a function over a pair of image and tag and their exemplars , . The whole affect function is written as follows,
where aims to set different weights according to whether is equal to and whether is equal to ,
In this paper, we instantiate this affect function similar to the Ising model based on the following aspects: if an image is selected as an exemplar, the tags linking this image should have larger probability to be selected as exemplars, and vice versa. In other words, we would assign a higher penalty for a pair of related imageand tag when they do not select themselves as their exemplars simultaneously or do not select others as their exemplars simultaneously. Specifically, we set and to negative values and set and to zero.
For , a similar valid constraint is defined as
and is defined similarly to .
In summary, the overall objective function is as follows
where and are balance weights, and . Maximizing Eqn. (23) may get a byproduct, tag exemplars, and our approach mainly use them as a bridge to exploit tag information to help find image exemplars. We depict Eqn. (23) using a factor graph in Fig. 3. Each term in Eqn. (23) is represented by a function node and each label (or ) is represented by a variable node. Edges exist only between function and variable nodes, and a function node is connected to a variable node iff its corresponding term depends on the variable. Heterogeneous relations serve as a bridge to connect two factor graphs over images and tags.
3.1 message propagation
This section presents our proposed scalar message propagation algorithm, homogeneous and heterogeneous message propagation (), which transmits hybrid messages, over image and tag nodes, to maximize the objective function Eqn. (23). This algorithm starts from the max-sum scheme, and transform the vector massage propagation to scalar message propagation so that the algorithm is very fast.
Naive vector message propagation
We first present the naive vector message propagation algorithm. For simplicity, we only give the messages on the image side, and the messages on the tag side are similar. There are two vector messages between and , and additionally another message from the heterogeneous relation node . The two vector messages are depicted in Figs. 4(a) and 4(b), and formulated as follows,
Different from affinity propagation, we have additional two vector-valued messages exchanged between and . The message, , sent from variable to , consists of n real numbers, with one for each possible value of . The message, , sent from variable to , also consists of real numbers. The two messages are depicted in Figs. 4(c) and 4(d), and formulated as follows,
One core of this paper is to reduce the above vector-valued messages to scalar-valued messages. The derivation is generalized from , but it is nontrivial and more challenging because our problem involves the heterogeneous relations that cannot be simplified directly using the derivation . Due to space limitation, we omit the detail derivation111Please see the supplementary material if the reviewers are interested in the derivation. from vector messages to scalar messages. As a result, views each image or tag as a node in a network, and recursively transmits scalar-valued messages along edges of the network until a good set of image and tag exemplars emerges. is different from the original affinity propagation algorithm  in that transmits not only the homogeneous messages within images and tags, including responsibility and availability, and depicted in Figs. 5(a) and 5(b), but also the heterogeneous messages between images and tags, including discardability and contributability, and depicted in Figs. 5(c) and 5(d). In the following, we will present four kinds of messages, and for convenience, we would only present the homogeneous messages over images as the messages over tags are similar and present the heterogeneous messages by standing at the image side as the messages on the tag side can also be similarly obtained. For presentation simplicity, we drop the subscripts .
Homogeneous message propagation
The “responsibility” and “availability” messages in are updated as follows,
The key difference in the two messages from the original affinity propagation lies in the responsibility , which involves the heterogeneous message, i.e., the “contributability” message from tag to image . This serves as a bridge in which the affect from tags will be transmitted to images. In the iteration process, would become relatively larger when the probability of tag being an exemplar becomes larger, and become smaller otherwise. Looking at Eqn. (37), we can observe that the contributability message takes effect when , which means that it affects the preference of image being an exemplar. Hence, the probability of image , selecting itself as its exemplar, would be affected positively monotonically by the probability that tags linking to image serve as exemplars.
|AP/k-means ||BGP||TGP||Joint |
Heterogeneous message propagation
There are two kinds of message exchanged between images and tags. The “discardability” , sent from image to tag , which reflects how much it is affected that image selects itself as its exemplar when the contribution of word is discarded and helps tag make better decision whether to select itself as its exemplar. The “contributability” , sent from tag to image , which reflects how well image serves as an exemplar considering whether tag is an exemplar. The two messages are updated as
Here, in Eqn. (39), is the belief that image selects itself as its exemplar, and aims to evaluate the affect degree if the contribution from tag to image is discarded and help tag make better decision whether to select itself as its exemplar. In evaluating the contributability message from tag to image in Eqn. (40), means that the belief that tag selects itself as its exemplar without considering the contribution from image , and evaluates the contribution from tag to the probability that image serves as an exemplar. essentially means that the degree that image serves as an exemplar whether tag serves as an exemplar. Similarly, means that the degree that image does not serve as an exemplar whether tag serves as an exemplar. Their difference, called contributability, hence can evaluate how well image serves as an exemplar considering the contribution from tag . means positive contribution from tag , and negative contribution otherwise.
The belief that image selects image as its exemplar is derived as the sum of the incoming messages,
Then the exemplar of image is taken as
It should be noted that the heterogeneous relations are latently involved in assigning the exemplars because the responsibility already counts the contribution from tags that is indicated in Eqn. (36) and Eqn. (37).
To summarize, is an iterative algorithm, and at the beginning all the eight kinds of messages are initialized as 0, and the eight messages are repeatedly updated until the iteration number reaches or the identified exemplars do not change. The algorithm is described in the following,
3.2 Analysis and discussion
This subsection presents the time complexity analysis and discusses the relations of our approach with several existing approaches.
The naive implementation of would take with the iteration number. Through the trick of reusing some computations, (e.g., in Eqn. (36), in Eqn. (37), and in Eqn. (38), are just computed one time for each in one iteration), the time complexity of our implementation is reduced to with being the edge number.
Our solution to hybrid image summarization is different from two previous representative techniques by Simon et al.  and Raguram and Lazebnik  in the following aspects. Simon et al. compute the visual summary by greedy k-means only using the visual information, without exploring the useful associated textual information. Raguram and Lazebnik use a joint clustering method, which first obtains two independent clusters from visual and textual features, respectively, and then takes their intersection to get the final clustering, but the joint process is obtained sequentially instead of simultaneously as our approach.
The proposed approach, , is capable to exploit both the relations within images and tags and the relations between images and tags. Most related approaches are only able to capture partial relations. For example, AP (affinity propagation ) can only exploit homogeneous relations over images and words respectively, BGP (Bipartite graph partitioning ) can only exploit heterogeneous relations between images and tags, TGB (Tripartite graph partitioning ) uses the visual features besides heterogeneous relations between images and tags to help find the grouping without exploring the interior relations within tags. The comparison is summarized in Tab. 1.
Similar to AP , the self-similarity of an image , i.e., the preference of an image being an exemplar, is set as with being the median image similarity. is useful to control the exemplar number. For tags, we adopt the WordNet similarity , a variety of semantic similarity and relatedness measures based on a large lexical database of English, WordNet . The self-similarities of words are similarly set.
Let’s turn to the setting of and in Eqn. (23). Looking at Eqn. (37), we observed that the heterogeneous relations essentially adjust the preference of image , , through the contributability from tag to , and hence it is expected that is comparable with the preference . Furthermore, we observed that is computed from in Eqn. (40) and sent from tag to image is computed from the belief of word being an exemplar that is related to tag similarities in Eqn. (39). Thus, to make in the heterogeneous relations easily tuned, which may benefit from the comparable preferences of tag and image, in our experiment we fix and .
For in the heterogeneous relations , we set , and , where is a constant negative value, fixed as 15 in this paper, to control the mutual affect degree for image and tag exemplar identification, and the division by the tag number connecting image , , aims to averagely separate its affect to connected tags.
In our experiment, we present the performance comparison of our approach with several relate approaches. This collection consists of about 11k images and associated tags and is crawled from the popular photo sharing Web site Flickr, using the queries, including flower, city, building, dog, cat, plants, mountain, river, sunset, and so on. We filter out some noisy tags that few images are associated with and finally get 816 tags. On average, each image has 6.1 tags and each tag is assigned to 15.9 images. We extract a GIST scene descriptor , which has been shown to work well for scene categorization, as the image feature with 3 by 3 spatial resolution where each bin contains that image region’s average response to steerable filters at 6 orientations and 3 scales, and use the negative Euclidean distance as the image similarity.
We investigate the performances from both the visual and semantic perspectives. In the literature of image summarization and clustering, most evaluation criteria use the class labels of the images to test the performance. However, they are not adoptable for our hybrid summarization because hybrid summarization has multiple objectives, visual and sematic objectives and no simple labels can be applied here. Instead, we present two straightforward measures, visual exemplarness and sematic exemplarness. Visual exemplarness is defined as the average value of visual similarities between each image and its corresponding exemplar, and semantic exemplarness is defined as the average value of textual similarities between the associated tags of each image and its corresponding exemplar.
4.1 Quantitative comparison
We present a quantitative comparison of our approach () with several representative approaches, AP (affinity propagation ), BGP (Bipartite graph partitioning ), TGB (Tripartite graph partitioning ), and recently developed two methods: greedy k-means  and joint clustering . Fig. 6(a) and Fig. 6(b) illustrate the performances of different approaches in terms of semantic and visual exemplarness with different number of exemplars.
For semantic exemplarness, constantly outperforms the other approaches except the joint clustering approach  and its performance is a little worse than the joint clustering approach when the number of exemplars exceeds 50. This is understandable because our approach balances the visual and semantic performances while joint clustering generates results by taking intersection between the results using visual feature and text feature to cluster images, and hence may get superiority for semantic performance when the cluster number is very large. However, the performance for modest number of exemplars is more meaningful, because too many exemplars are not preferred in summarization. From this sense, our approach is more satisfactory in semantic performance.
For visual exemplarness, both AP and show significant advantages over the other approaches. The visual performance of is only a little worse than that of AP that purely uses visual feature, which is reasonable since our approach also takes into consideration the semantic information. In summary, achieves satisfactory semantic and visual performance compared with other approaches.
|Joint ||-2.205||-0.917||-3.596||-1.062||-||- 0.499|
4.2 Visual comparison
We present visual results on three representative groups of images from Flickr, “Anchorage, Alaska”, “Roma-Rome”, and “The Great Wall of China”. We crawled top 970, 928, and 133 images, respectively.
The visual results of “Anchorage, Alaska” “Roma-Rome” from the six approaches are depicted in Fig. 8, and their quantitative comparison is in Tab. 2. Our results look visually appealing, and the superiority in semantic performances shows that the obtained visual summary can capture the semantic meaning, which benefits from the associated tags. The other methods cannot get competitive performance because those methods have partial or little ability to exploit homogeneous and heterogeneous relations.
4.3 User study
In addition, we present a user study to compare visual summaries of six approaches. We collect the feedback from 20 persons on 10 tagged image collections. For each person, we show him a tagged image collection and randomly select a visual summary from the six results corresponding six approaches, and allow him to given a score from 1 (the worst) to 5 (the best) to indicate how well the visual summary represents the image collection from the visual and semantic perspectives. The user study shows that our approach obtains the best performance 4.3, and the scores of other approaches, AP, BGP, TGP, joint clustering  and greedy k-means  are 3.6, 3, 3.2, 3.9, and 3.8, respectively. This user study demonstrates that our approach can get better visual summary compared with other approaches.
We demonstrate the hybrid summarization in Flickr group overview by presenting both image and text summarization. “Flickr groups are a fabulous way to share content and conversation, either privately or with the world. Believe us when we say there’s probably a group for everyone, but if you can’t find one you like, feel free to start your own.”. The group images are displayed page by page, and each page shows a dozen of images. To have an overview of a group of images, uses have to check the images page by page, and also there is no textual description for the group of images. Hence it is desired to deliver visual and textual summaries of the group. Fig.7 shows an example over “The Great Wall of China”. It is surprisedly that the hybrid summarization suggests two tags: simatai and mutianyu. After checking this group manually, we found that this group only contains the photos from two sites.
|(b) Visual summary|
|(a) Sample images in the group||(c) Textual summary|
In this paper, we present hybrid image summarization scheme to manage image collections. Toward this end, we propose a novel approach, homogeneous and heterogeneous message propagation, which is a non-trivial generalization of the affinity propagation algorithm from homogeneous data to heterogeneous data. Compared with the conventional message propagation algorithms that transmit the vector-valued messages, our algorithm reduces vector-valued messages to scalar-valued messages, and hence is more efficient. Moreover, this reduction in our case is more complicated than in affinity propagation because it involves additional heterogeneous relations. The application of our approach to hybrid image summarization can effectively exploit image similarities and even the useful information from the associated tags, including the association relations between images and tags and the relations within tags. The experimental results demonstrate its effectiveness and efficiency.
We rewrite the objective function, which corresponds to Eqn. (11) in the submitted paper.
This objective function is depicted as a factor graph Fig. 3. The max-sum algorithm , a general algorithm to factor graph, can get the solution by transmitting vector messages between function nodes and variable nodes. We would derive a scalar message propagation scheme, homogeneous and heterogeneous message propagation, by reducing vector messages over function and variable nodes to scalar messages over variable nodes.
The max-sum algorithm is an iterative algorithm to exchange two kinds of messages: one is from function nodes to variable nodes, and the other is from variable nodes to function nodes. For the factor graph Fig. 3 corresponding to Eqn. (23), the message propagation over variables and is almost the same, For convenience we will give the derivation over variable , and drop the subscript in in the following presentation.
There are two messages exchanged between and . The message, , sent from to , consists of real numbers, one for each possible value of . The message, , sent from to , also consists of real numbers. The two messages are depicted in Figs. 4(a) and 4(b), and formulated as follows:
There are two messages exchanged between and . The message, , sent from variable to , consists of n real numbers, one for each possible value of . The message, , sent from variable to , also consists of real numbers. Let represent the edge set connecting image . For simplicity, we drop the subscript in without influencing understanding. The two messages are depicted in Figs. 4(c) and 4(d), and formulated as follows:
In the following, we show that those vector-valued messages can be reduced to scalar-valued messages, making the propagation much more efficient. The derivation is generalized from , but it is nontrivial and more challenging since the message is additionally propagated between heterogeneous data, images and words. We directly present the results for and messages by omitting detailed derivation that can be obtained using the similar technique as in . We present the derivation detail for and messages. The idea behind the derivation is to analyze the propagated messages in the two cases whether is valued as or not.
Let , with .
Let , with . It can be derived that is independent to the specific value .
Let , and
For and , we can obtain
For and , we have the following derivations
It can observed that only the variables and for and and for are involved in the message passing. Therefore, we can define scalar-valued variables , , , and . These scalar messages are summarized as follows.
Homogeneous message propagation
There are two kinds of messages exchanged within image points. The “responsibility” , sent from data point to data point , which reflects how well serves as the exemplar of considering other potential exemplars for , and the “availability” , sent from data point to data point , which reflects how appropriately chooses as its exemplar considering other potential points that may choose as their exemplar. The messages are updated in an iterative way as
Heterogeneous message propagation
There are two kinds of message exchanged between images and words. The “discardability” , sent from image to word , which reflects how much it is affected that image selects itself as its exemplar when the contribution of word is discarded and helps word make better decision whether to select itself as its exemplar. The “contributability” , sent from word to image , which reflects how well image serves as an exemplar considering whether word is an exemplar. The two messages are updated as