Recent years have shown a massive outburst of multi-media content over the internet, and thus accessing/extracting useful information has become increasingly difficult. Multi-media summarization can alleviate this problem by extracting the crux of data, and discarding the redundant or useless information. A multi-modal form of knowledge representation has several advantages over uni-modal representation of content, as it gives a more complete overview of summarized content, and provides diverse perspectives on the same topic. Having multiple forms of content representation helps reinforce ideas more concretely. Multi-modal summarization can help target a larger set of diverse reader groups, ranging from skilled readers looking to skim the information to users who are less proficient in reading and comprehending complex texts (UzZaman et al., 2011).
Experiments conducted by (Zhu et al., 2018) illustrate that a multi-modal form of representation in the summary improves user satisfaction by 12.4% compared to text summary. Most of the research work in the past has however focused on uni-modal summarization (Gambhir and Gupta, 2017; Yao et al., 2017), be it text or images. Multi-modal summarization poses different challenges as one needs to also take into account the relevance between different modalities. In this paper, we focus on the task of text-image-video summary generation (TIVS) proposed by (Jangra et al., 2020). Unlike most text-image summarization researches, we use asynchronous data for which there is no alignment among different modalities. We propose a novel differential evolution based multi-modal summarization model using multi-objective optimization (DE-MMS-MOO). The framework of our model is shown in Fig 1 and the main contributions of this work are as follows:
This is the first attempt to solve TIVS task using multi-objective optimization (MOO) framework. MOO helps in simultaneous optimization of different objective functions like cohesion in individual modalities and consistency between multiple modalities.
The proposed framework is generic. Any MOO technique can be used as the underlying optimization strategy. Here we selected a differential evolution (DE) based optimization technique, since it has been recently established that DE performs much better compared to other meta-heuristic optimization techniques(Das et al., 2007).
The proposed model considers multimodal input (text, images, videos) and produces multimodal output (text, images and videos) with variable size of output summary.
2. Related Work
Many methods have been proposed in the field of text summarization, both extractive and abstractive(Gambhir and Gupta, 2017; Yao et al., 2017)
. Researchers have tried different approaches to tackle this problem, ranging from using integer linear programming(Alguliev et al., 2010)
, deep learning models(Zhang et al., 2016), to graph-based techniques (Erkan and Radev, 2004) etc. Research on the joint representation of various modalities (Wang et al., 2016) has made the field of multi-modal information retrieval feasible. Multi-modal summarization techniques vary from abstractive text summary generation using asynchronous multi-modal data (Li et al., 2017)
, to abstractive text-image summarization using deep neural networks(Zhu et al., 2018)
. Some research works have used genetic algorithms for text summarization(Saini et al., 2019), yet, to the best of our knowledge, no one has ever used multi-objective optimization based techniques for solving the TIVS problem.
3. Proposed Model
We propose a multi-objective optimization based differential evolution technique to tackle the TIVS problem. The proposed approach takes as an input a topic with multiple documents, images and videos, and outputs an extractive textual summary, along with selected salient images and videos111For simplicity, in the current settings we output a single video only, assuming that one video is often enough..
Given a topic, we have multiple related text documents, images and videos as an input. In order to extract important features from the raw data, we fetch key-frames from the videos and combine them with existing images to form the image-set. The audio is transcribed (using IBM Watson Speech-to-Text Service222 http://www.ibm.com/watson/developercloud/speech-to-text.html), and the resulting transcriptions together with text form the text set. The text in text-set is encoded using the Hybrid Gaussian-Laplacian mixture model (HGLMM) proposed in (Klein et al., 2014), while the images are encoded using the VGG-19 model (Simonyan and Zisserman, 2014). These model specific encodings are next fed to a two branch neural-network (Wang et al., 2016)
to have 512-dimensional image and sentence vectors.
3.2. Main model
3.2.1. Population Initialization
We can see from Table 1 that the Double K-medoid algorithm performs at least as good as the multi-modal K-medoid algorithm in all the modalities (see Section 4.2). Thus we initialize the population () using the double K-medoid algorithm. Each solution is represented as ¡, , …, ¿ : ¡, , …, ¿, where is the text cluster center, and is the maximum cluster size for text. The image part of the solution is represented similarly. The number of clusters for text and image can vary from 2 to for text, and from 2 to for images333 If the number of clusters is less than the maximum value, we pad the solution.
If the number of clusters is less than the maximum value, we pad the solution..
3.2.2. Generation of off-springs
Cross-over: For each solution in the population, we generate a mating pool by randomly selecting solutions to create new solutions. We use Eq. 1 to generate a new offspring, . This new solution is then repaired using Eq. 2.
where is the solution for which the new offspring, , is being evaluated; , are two elements from the mating pool of the solution, ; and
is the cross-over probability.
where and are the lower and upper bounds of the population, respectively.
Mutation: For each solution, we have also used three different types of mutation operations: polynomial mutation (see Eq. 3), insertion mutation and deletion mutation. Polynomial mutation help in exploring the search space and increases the convergence rate (Deb and Tiwari, 2008). Insertion and deletion mutations also enhance exploration capabilities by increasing and decreasing the size of solution. The clusters are re-evaluated if Insertion or Deletion mutation occurs.
3.2.3. Selection of top solutions
We use the concept of non-dominating sorting and crowding distance to select the best solutions (Deb et al., 2002).
3.2.4. Stopping criteria
The process is repeated until the maximum number of generations () is reached.
3.2.5. Objective functions
We propose two different sets of objective functions, based on which we design two different models.
Summarization-based: Three objectives ¡Sal(txt) / Red(txt), Sal(img) / Red(img), CrossCorr¿ are simultaneously maximized. Salience, redundancy and cross-modal correspondence are calculated using Eq. 4, 5, 6, respectively.
where , is the cluster for modality , and returns the elements of cluster.
Clustering-based: We use PBM index (Pakhira et al., 2004), which is a popular cluster validity index (function of cluster compactness and separation) to evaluate the uni-modal clustering for text and images. Thus, we maximize three objectives ¡PBM(txt), PBM(img), CrossCorr¿ where cross-modal correspondence is evaluated by Eq. 6.
After the termination criteria is met, the model outputs
solutions containing text-image pairs of variable length. We select the Pareto optimal solutions from the population, and for each solution, we generate the text summary from the text part of the solution. In order to generate the image summary, we select those image vectors that are not key-frames, and also select those images from the initial images that have a minimum cosine similarity ofand maximum similarity of (Jangra et al., 2020). For comparison the values of are kept the same as in (Jangra et al., 2020). For each video, a weighted sum of visual and verbal scores is computed as described in (Jangra et al., 2020).
|Model||ROUGE R-1||ROUGE R-2||ROUGE R-L||Image Average Precision||Image Average Recall||Video Accuracy|
|Random video selection (10 attempts)||-||-||-||-||-||16%|
|Uni-modal optimization based DE-MMS-MOO||0.352||0.183||0.318||0.7382||0.9435||44%|
4. Experimental Setting
We use the multi-modal summarization dataset prepared by (Jangra et al., 2020). The dataset consists of 25 topics describing different events in the news domain. Each topic contains 20 text documents, 3-9 images and 3-8 videos. For each topic there are also three text references, and at least one image as well as one video are provided as the multi-modal summary.
We evaluate our proposed model with several strong baselines ranging from existing state-of-the-art techniques to novel approaches that we propose.
We evaluate the quality of our text summary against the graph-based TextRank algorithm (Mihalcea and Tarau, 2004), by feeding it the entire text444 We use Python’s open-source Gensim library’s implementation:
We use Python’s open-source Gensim library’s implementation:https://radimrehurek.com/gensim/summarization/summariser.html..
Baseline-2 (Image match MMS): A greedy technique to generate textual summary using multi-modal guidance strategy is proposed in (Li et al., 2017)555The paper does not report ROUGE R-L scores.. Out of multiple variations proposed in that research, the image match seems to be the most promising, and thus we use it to compare with our model.
Baseline-3 (JILP-MMS): Jangra et. al. (Jangra et al., 2020) proposed a joint-integer linear programming based method to generate multi-modal summary. JILP-MMS model uses intra-modal salience, intra-modal diversity and inter-modal correspondence as objective functions.
Baseline-4 (Double K-medoid): After the preprocessing step, we perform two separate K-medoid clustering algorithms, one for sentences and the other for images. Since the text and images share the representation space, the other modality participates in the clustering process in the sense that it cannot become the cluster center, but it can still participate in the membership calculation of each cluster. The rest of the process is the same as the standard K-medoid algorithm666For all the k-medoid steps performed in our research we applied kmeans++ seeding (Arthur and Vassilvitskii, 2006) over randomly initialized cluster centers..
Baseline-5 (multi-modal K-medoid): In this method, we assume that there is one single modality, and we run the K-medoid algorithm until convergence. The top-K sentences and top-K images are selected as the data points which are nearest to the cluster centers for each of the clusters, respectively, for each modality.
Baseline-6 (Uni-modal optimization based DE-MMS-MOO): We use the framework proposed in Section 3, but instead of tri-objective optimization, we instead optimize two objectives ¡Sal(txt) / Red(txt), Sal(img) / Red(img)¿, where salience and redundancy are calculated using Eq. 4 and Eq. 5
, respectively. For fair comparison all the hyperparameters and model settings are kept the same for this baseline as ones for the proposed models.
Table 1 shows that our model performs better than the rest of the techniques. In order to evaluate the scores for our population based techniques, the maximum of all the evaluation scores are taken per topic, in order to compare the best of our model’s ability with other baselines. The two proposed models, namely Summarization-based DE-MMS-MOO and Clustering-based DE-MMS-MOO, perform better than the other models in different modalities. As Double K-medoid baseline performs better than the multi-modal K-medoid baseline, we use this technique for solution initialization in all the proposed models. The Uni-modal optimization based DE-MMS-MOO model works better than the clustering based baselines (Double K-medoid, multi-modal K-medoid). This reassures the fact that differential evolution brings about a positive change. Both the proposed models perform at least as good as the Uni-modal optimization based DE-MMS-MOO baseline when trained under the same settings, and thus we can state that in order to generate a supplementary multi-modal summary, cross-modal correspondence is an important objective. This shows us that multiple modalities assist each other to bring out more useful information from the data. Since our model produces multiple summaries, it is important to ensue that all of the produced summaries are of good quality. To demonstrate this we draw a box-whiskers plot for ROUGE R-L score values of all the solutions on final Pareto front, for four randomly selected topics. Since all the modalities are equally significant, we cannot however directly comment on the superiority of one model over the other.
In this paper, we propose a novel multi-model summary generation technique that surpasses the existing state-of-the-art multi-modal summarization models. We use the proposed framework in two different objective settings, both of which have comparable performance in all the modality evaluations. Although we only explore the framework’s potential using differential evolution based approaches, the proposed framework is generic and is adaptable to different settings.
Acknowledgement: Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.
- Multi-document summarization model based on integer linear programming. Intelligent Control and Automation 1 (02), pp. 105. Cited by: §2.
- K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: footnote 6.
- Automatic clustering using an improved differential evolution algorithm. IEEE Transactions on systems, man, and cybernetics-Part A: Systems and Humans 38 (1), pp. 218–237. Cited by: 2nd item.
A fast and elitist multiobjective genetic algorithm: nsga-ii.
IEEE transactions on evolutionary computation6 (2), pp. 182–197. Cited by: §3.2.3.
Omni-optimizer: a generic evolutionary algorithm for single and multi-objective optimization. European Journal of Operational Research 185 (3), pp. 1062–1087. Cited by: §3.2.2.
- Lexrank: graph-based lexical centrality as salience in text summarization. Jour. of artif. intel. res. 22, pp. 457–479. Cited by: §2.
- Recent automatic text summarization techniques: a survey. Artificial Intelligence Review 47 (1), pp. 1–66. Cited by: §1, §2.
- Text-image-video summary generation using joint integer linear programming. In European Conference on Information Retrieval, Cited by: §1, §3.3, §4.1, §4.2.
- Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399. Cited by: §3.1.
- Multi-modal summarization for asynchronous collection of text, image, audio and video. Cited by: §2, §4.2.
Textrank: bringing order into text.
Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: §4.2.
- Validity index for crisp and fuzzy clusters. Pattern recognition 37 (3), pp. 487–501. Cited by: §3.2.5.
- Extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. Knowledge-Based Systems 164, pp. 45–67. Cited by: §2.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
- Multimodal summarization of complex sentences. In Proceedings of the 16th international conference on Intelligent user interfaces, pp. 43–52. Cited by: §1.
Learning deep structure-preserving image-text embeddings.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: §2, §3.1.
- Recent advances in document summarization. Knowledge and Information Systems 53 (2), pp. 297–336. Cited by: §1, §2.
Multiview convolutional neural networks for multidocument extractive summarization. IEEE transactions on cybernetics 47 (10), pp. 3230–3242. Cited by: §2.
- MSMO: multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4154–4164. Cited by: §1, §2.