, aiming at generating a short sentence or a phrase conditioned on certain visual information. With the development of deep learning and reinforcement learning models, recent years witness promising improvement of these tasks for single-image-to-single-sentence generation.
Visual storytelling moves one step further extending the input and output dimension to a sequence of images and a sequence of sentences. It requires the model to understand the main idea of the image stream and generate coherent sentences. Most of existing methods [10, 16, 34, 25] for visual storytelling extend approaches of image captioning without considering topic information of the image sequence, which causes the problem of generating semantically incoherent content.
An example of visual storytelling can be seen in Figure 1. An image steam with five images about a car accident is presented accompanied with two stories. One is constructed by a human annotator and the other is produced by an automatic storytelling approach. There are two problems with the machine generated story. First, the sentiment expressed in the text is inappropriate. In face of a terrible car accident, the model uses some words with positive emotion, i.e., “happy” and “excited”. Second, some sentence is uninformative. The sentence “we got to see a lot of different things at the event” provides little information about the car accident. This example shows that topic information about the image sequence is important for the story generator to produce an informative and semantically coherent story.
In this paper, we introduce a novel task of topic description generation to detect the global semantic context of an image sequence and generate a story with the guidance of such topic information. In practice, we propose a framework named Topic-Aware Visual Story Telling (TAVST) to tackle the two generation tasks: (1) topic description generation for the given image stream; (2) story generation with the guidance of the topic information. To effectively combine these two tasks, we propose a multi-agent communication framework that regards the topic description generator and the story generator as two agents. In order to enable the interaction of these two agents for information sharing, an iterative updating (IU) module is incorporated into the framework. Extensive experiments on the VIST dataset  show that our framework achieves better performance compared to state-of-the-art methods. Human evaluation also demonstrates that the stories generated by our model are better in terms of relevance, expressiveness and topic consistency.
Given an image stream , where is the number of images, we aim to output a topic description and sub-stories to form a complete story
. The proposed framework mainly includes three stages, namely visual encoding, initial stage of generation and iterative updating (IU). The visual encoder is employed to extract image features as visual context vectors. In the initial stage, we have the initial version of the two generation agents. The initial topic description generator takes visual context vectors as input and generates a topic vector. The initial story generator combines the topic vector and visual context vectors via co-attention mechanism and construct the initial version of story. Considering that the generated story can also benefit the topic description generator, the two agents communicate with each other in the IU module via message passing mechanism as fine tuning. The overall architecture of our proposed model is shown in Figure2. Each of these modules will be described in details in the following sections.
Given an image stream with images, we first extract the high-level visual features of each image through a pre-trained CNN based model - ResNet . Then for the whole image stream, following the previous works [34, 27]
, a bidirectional gated recurrent unit (biGRU) is employed as the visual encoder:
where is the forward hidden state at time step of -th visual feature , while the is the backward one. At each time step, the visual features are sequentially fed into the visual encoder to obtain visual context vector, which has integrated the visual information from all the images that have been observed.
At last, fused with the visual representation followed by ReLU layer, the final visual context vectorcan be calculated as:
where is a projection matrix.
Initial Topic Description Generator
Given the visual context vector extracted from the image sequence, we first learn to generate the topic description. In practice, all visual context vectors are concatenated and then fed into the initial topic description generator that employes a gated recurrent unit (GRU) decoder.
of this decoder is a sequence of probability distribution over the whole topic vocabulary. The training loss of initial topic description generator is the cross-entropy between the generated description and the ground-truth topic description .
Note that at each time step, the decoder produces the a hidden state . Once the last topic hidden state is obtained, we concatenate all topic hidden states , as the topic memory, which are fed into the story generation module.
Initial Story Generator with Co-attention Network
The initial story generator is responsible for generating the story with the guidance of the topic description constructed by the initial topic description generator.
In order to combine both visual information and topic information for story generation, we adopt a co-attention mechanism [17, 11] for context information encoding. The structure of the co-attention encoding module can be seen in Figure 3. Specifically, given visual context vectors and topic vectors
, the affinity matrixis calculated by
where is the weight parameter. After calculating this matrix, we compute attentions weights over the visual context vectors and the topic vectors via the following operations:
where , , , are the weight parameters. Then based on the above attention weights, the visual and semantic attentions are calculated as the weighted sum of the visual context vectors and the topic vectors:
At last, we concatenate the visual and semantic attentions as , and then use a fully connected layer to obtain the joint context vector:
In the story decoding stage, each joint context vector is fed into a GRU decoder to generate a sub-story sentence for the corresponding image. Formally, the generation process can be written as:
where denotes the -th hidden state of -th GRU. We concatenate the previous word token and the context vector as the input at each step. The output is a probability distribution over the whole story vocabulary .
Loss Function for Training
We define two different loss functions including cross-entropy (MLE) and reinforce (RL). MLE loss is show in Equation8:
where is the parameter of story generator; is the ground-truth story and denotes the -th word in .
Recently, reinforcement learning has shown effectiveness for training text generation model via introducing automatic metrics (e.g. METEOR, etc.) to guide the training process . We also explore the RL-based approach to train our generator. The reinforcement learning (RL) loss can be written as:
where is a sentence-level metric for the sampled sentence and the ground-truth ; is the baseline which can be an arbitrary function but a linear layer in our experiments for simply. To stabilize the RL training process, a simple way is to linearly combine MLE and RL objectives as follows :
where hyper-parameter is employed to control the trade-off between MLE and RL objectives.
In the initial stage, a combined loss function of and is computed through:
where hyper-parameter is employed to balance these losses.
Iterative Updating Module
Considering that the generated story would also be helpful for the generation of topic description, we design an iterative updating module for the two agents to interact with each other and update iteratively. In IU module, we generate the topic description from the previously generated story, and then use such topic information to further guide story generation. To distinguish the two agents from those of initial version, we call them the IU version.
IU Topic Description Generator
We envisage that the generated story is able to provide more accurate information for topic description generation than visual information. Therefore, instead of using visual information as input, the IU version of topic description generator takes the generated story as input. Specifically, the last hidden states of the IU story generator is used as input. Note that the IU version topic description generator is initialized as its initial version and keeps training with the same objective.
IU Story Generator
The IU story generator shares the structure and parameters with the initial story generator. It takes both topic vector and visual context vector as input. In the decoding process, the story is the concatenation of the sub-stories generated by IU story generator, and the last hidden states of the IU story generator will be passed to IU topic description generator for iterative updating.
Loss Function for Training
At each iteration stage , the IU module loss is the weighted sum of IU topic description generation loss and IU story generation loss :
where hyper-parameter is employed to balance these losses.
The IU topic description generator and IU story generator communicate with each other iteratively in the IU module until it reaches the given iteration number . The loss for IU module is:
Therefore, to train the whole multi-agent learning framework, we introduce a combined loss which consists of the initial loss and IU module loss :
where is a hyper-parameter to balance these losses. During training, our goal is minimizing
using stochastic gradient descent.
The VIST dataset  is the benchmark for the evaluation of visual storytelling. It includes 10,117 Flicker albums with 210,819 images. We evaluate our method on VIST dataset and use the same split settings as previous works [10, 34, 27]. The samples have been split into three parts, 40,098 for training, 4,988 for validation and 5,050 for testing, respectively. Each sample (album) contains five images and a story consisting of five sentences. We use the title of each album as the ground-truth topic description.
|TAVST w/o IU (MLE)||63.1||38.6||22.9||14.0||29.7||8.5||35.1|
|TAVST w/o IU (RL)||63.5||39.2||23.2||14.3||30.0||8.7||35.3|
|TAVST w/o IU||5.1||2.2||12.6||10.9||4.8|
We use the pre-trained ResNet-152  model to extract image features.The vocabulary for story and topic include words appearing no less than three times in corresponding parts (i.e., story and title) from the training set. And all the rest of words are represented as UNK. We adopt GRU models for both visual encoder and other decoders, and the hidden size of them is 512. Expect the encoder is bidirectional, the other decoders are unidirectional.
The batch size is set as 64 during the training. We use Adam  to optimize our models with the initial learning rate of 0.0002. We first pre-train the initial topic description generator using MLE. Then we pre-train both the topic description generator and the story generator jointly using MLE. The number of iteration is set to 2, the weight of RL is , and hyper-parameters in loss optimization are set as , and , which are selected based on validation set (the details about hyper-parameter sensitivity analysis are shown in Supplementary Material). After warm-up pre-training, and learning-rate are set to 0.8 and 0.00002 to fine-tune using RL. Here we use METEOR scores as the reward. We select the best model which achieves the highest METEOR score on the validation set. The reason is that METEOR is proved to correlate better with human judgment than CIDEr-D in the small references case and superior to BLEU@N all the time [22, 25]. During the test stage, we generate the stories by performing a beam-search with a beam size of 3.
Models for Comparison
We compare our proposed methods with several baselines for visual storytelling, which are detailed as follows.
seq2seq : It generates caption for each single model via classic sequence-to-sequence model and concatenate all captions to form the final story.
h-attn-rank : On top of the classic sequence-to-sequence model, it adds an additional RNN to select photos for story generation.
HPSR : It introduces an additional RNN stacked on the RNN-based photo encoder to detect the scene change. Information from both RNNs are fed into an RNN for story generation.
AREL : It is based on the framework of reinforcement learning and the generation of a single word is treated as the policy. The reward model learns the reward function from human demonstrations.
HSRL : It is based on the framework of hierarchical reinforcement learning. The higher level agent is responsible for generating a local concept for each image as the guidance to the lower level agent for sentence generation .
VST: This is the baseline version of our model without using topic information as guidance.
TAVST w/o IU: This is our proposed TAVST method without IU module, which only equipped with initial topic description generator.
TAVST: This is our full model. TAVST (MLE) is trained using MLE while TAVST (RL) is trained via RL loss.
Automatic Evaluation Results
The overall experimental results are shown in Table 1. TAVST (MLE) outperforms all of the baseline models trained with MLE. This confirms the effectivness of topic information for generating better stories. Noticeably, compared with the RL-based models, our TAVST (MLE) has already achieved a competitive performance and outperforms other RL models (i.e., AREL and HSRL) in terms of METEOR and BLEU@[2-4] metrics. After equipped with RL, our TAVST (RL) model is able to further improve the performance, outperforming the two RL models in terms of all metrics except CIDEr-D. Our full model TAVST (both MLE and RL versions) outperforms TAVST w/o IU, which directly demonstrates the effectiveness of the IU module. TAVST w/o IU achieves better performance than VST, which proves that topic description generator can provide guidance for story generation.
Topic Description Generation
The overall results for topic description generation can be seen in Table 2. TAVST achieves higher performance compared to TAVST w/o IU, indicating that the generated story is able to provide assistance for better topic description generation. In general, the description generator obtains low scores in terms of automatic metrics. observations on the dataset reveal that the length of titles for each album is relatively short, ranging from 2 to 6 words mostly. Given such a short reference, it is difficult for models to obtain high scores in terms of automatic metrics. We further look into the generated descriptions and some of them are actually semantically correct. For example, the reference is “happy birthday party at my home” and the generated topic description is“the birthday gathering”. Another example is that the reference is “family feast” and the generated topic is “dinner party”. We believe such kind of topic description with similar meaning can still provide positive guidance for the story generator.
We perform two kinds of human evaluation through Amazon Mechanical Turk (AMT), namely the Turing test and the pairwise comparison. Since we only find one previous work  which published the sampled results of their model, we chose it for comparison. In specific, we re-collect human labels for their sample results and stories generated by our models on the same sub-set of albums. A total of 150 stories (750 images) are used, and each of them is evaluated by 3 human evaluators.
For the Turing test, we design a survey (as shown in Appendices) that contains an image stream, a generated story by our TAVST model and a story written by a human. Evaluators are required to choose the story that is more likely written by a human. The experimental result (Table 3) shows that 47.8% of evaluators think the stories generated by our model are written by a human (v.s 38.4% win rate from AREL).
|TAVST VST||TAVST AREL||TAVST GT|
A good story for an image stream should have three significant factors: (1) Relevance: the story should be relevant to the image stream. (2) Expressiveness: the story should be concrete and coherent, and have a human-like language style. (3) Topic Consistency: the story should be consistent to the topic. We compare our model with three other models in terms of these three metrics: TAVST vs VST, TAVST vs AREL and TAVST vs ground-truth. In this annotation task, AMT evaluators need to compare two given stories according to these three factors and choose which story is better in terms of a certain factor. Results are shown in Table 4. Our model performs better than the other two models in terms of relevance and topic consistency. The advantage of topic consistency is more promising. This proves that the topic description generator can help the story generation agent construct a more consistent story.
Further Analysis on Topic Consistency
We further evaluate the quality of the generated story in terms of topic consistency from the perspective of sentiment. Specifically, we employ a lexicon-based approach using a subjectivity lexicon. We count the number of sentiment words in each sentence for the polarity evaluation. The score will be 1,0,-1 if a sentence is positive, neutral and negative, respectively. Based on the score for each sentence, two qualitative experiments are designed to measure the in-story sentiment consistency and topic-story sentiment consistency.
In-story Sentiment Consistency
We argue that the sentiment of sentences in a story should be consistent given the album is related to a certain topic. For each story, we obtain a vector with 5 sentiment scores in correspondence to 5 sentences. We then calculate the standard deviation for the vector to represent the divergence score of a story. For each model, we average the divergence scores of all stories generated as its final score. Figure 4 presents the results from different models. Results illustrate that our method can generate stories with higher in-story sentiment consistency.
Topic-Story Sentiment Consistency
Considering that albums related to some events might express a tendency to a certain polarity. For example, the sentiment of stories about new year’s eve are more likely to be positive while the sentiment of stories about breaking up are more likely to be negative. We enumerate albums with different event types to see if the model has the ability to generate stories with sentiment consistent with the type of events. For each story, we add all the sentiment scores of sentences as its final score. The higher score a story obtain, the more positive the story is. Four types of events are considered. Results is shown in Table 5. In general, all automatic models tend to generate stories with higher sentiment scores compared to human-written stories. This is because a large portion of albums in the dataset are related to positive events. Both VST and AREL generate stories with similar sentiment scores for both types of events. This indicates that they are not able to distinguish positive and negative events. With the guidance of topic description, our model TAVST is able to distinguish events with different sentiment tendency.
Figure 5 shows an example of the ground-truth story and stories generated automatically by different models. The words in red, blue and yellow color represent the topic, subject, and emotion, respectively. Our model shows promising results according to topic consistency, which further confirms that our model can extract appropriate topic which serves as the guidance of generating a topic-consistent story.
4 Related work
Our research is related to image captioning, visual storytelling and multi-task learning. In early works [33, 14, 4], image captioning task is treated as a ranking problem, which is based on retrieval models to identify similar captions from the database. Later, the end-to-end framework based on the CNN and RNN is proposed by researchers [32, 12, 23, 3, 6]. Such work focuses on the literal description of image content. Although some encouraging results have been generated in connecting vision and language, the generated text is limited in a single sentence.
Visual storytelling is the task of generating a narrative paragraph given an image stream. The pioneering work was done by  that retrieves a sequence of setnences for an image stream. huang2016visual huang2016visual introduces the first dataset (VIST) for visual storytelling and establishes some baseline approaches. An attention-based RNN with a skip gated recurrent unit  is designed to maintain longer range information. yu2017hierarchically yu2017hierarchically designs a hierarchically-attentive RNN structure. Recently, a reinforcement learning framework with two discriminators is proposed 
for this task. Due to the bias can be brought by the hand-coded evaluation metrics, Wang:2018tda Wang:2018tda proposes an adversarial reward learning framework to uncover a robust reward function from human demonstrations.
The most similar work to ours is from huang2019hierarchically huang2019hierarchically. They propose to generate a local semantic concept for each image in the sequence and generate a sentence for each image using a semantic compositional network in a fashion of hierarchical reinforcement learning. Although both of us consider topic information to facilitate the story generation. Our model is different from three aspects. First, the concepts of topic are different. We treat topic as the global semantic context of the album while topic represents local semantic information in their case adhering to each single image. Second, our modeling topic is more interpretable. We generate topic description directly instead of producing latent representation and this provides more insights for further improving the performance. Third, the communication framework is compatible with any RL based training methods. Experiment results also show that with RL, our framework can outperform their model.
collobert2008unified collobert2008unified first proposed a method for processing NLP tasks in a deep learning framework using multi-task learning. Jing2018OnTA Jing2018OnTA build a multi-task learning framework which jointly performs the prediction of tags and the generation of paragraphs. These multi-task learning methods share a certain network structure, and at the output layer design a specific network structure for different tasks, improving the performance of different tasks. However, unlike these multi-task learning methods, we use another multi-agent method [21, 26]. In this work, we define two kinds of agents for two generation tasks which can interact and share useful information. We also notice that in other areas, there are also some works [31, 28] consider incorporating topic information.
5 Conclusions and Future Work
In this paper, we introduce a topic-aware visual storytelling task, which identifies the global semantic context of a given image sequence and then generate the story with the help of such topic information. We propose a multi-agent communication framework that combines two generation tasks namely topic description generation and story generation effectively. In future, we will explore to model topic generation as a keyword extraction task. Besides, we will study to identify the interactions among images for better visual information encoding.
-  (2015) VQA: visual question answering. In ICCV, Cited by: §1.
-  (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, pp. 65–72. Cited by: §3.
-  (2017) Towards diverse and natural image descriptions via a conditional gan. In CVPR, pp. 2970–2979. Cited by: §4.
-  (2013) Image description using visual dependency representations. In EMNLP, pp. 1292–1302. Cited by: §4.
-  (2018) A question type driven framework to diversify visual question generation.. In IJCAI, pp. 4048–4054. Cited by: §1.
-  (2019) Bridging by word: image grounded vocabulary construction for visual captioning. In ACL, pp. 6514–6524. Cited by: §4.
-  (2018) A reinforcement learning framework for natural question generation using bi-discriminators. In COLING, pp. 1763–1774. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2, §3.
-  (2019) Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI, pp. 8465–8472. Cited by: §3, Table 1.
-  (2016) Visual storytelling. In NAACL, pp. 1233–1239. Cited by: §1, §1, §3, §3, Table 1.
-  (2018) On the automatic generation of medical imaging reports. In ACL, pp. 2577–2586. Cited by: §2.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, pp. 3128–3137. Cited by: §1, §4.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §3.
-  (2011-06) Baby talk: understanding and generating simple image descriptions. In CVPR, pp. 1601–1608. External Links: Cited by: §4.
-  (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, pp. 605. Cited by: §3.
Let your photos talk: generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI, pp. 1445–1452. External Links: Cited by: §1, §4.
-  (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, pp. 289–297. Cited by: §2.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §3.
-  (2015) Expressing an image stream with a sequence of natural sentences. In NIPS, pp. 73–81. External Links: Cited by: §4.
-  (2017) Deep reinforcement learning-based image captioning with embedding reward. In CVPR, pp. 290–298. Cited by: §1.
Learning multiagent communication with backpropagation. In NIPS, pp. 2244–2252. Cited by: §4.
-  (2015) Cider: consensus-based image description evaluation. In CVPR, pp. 4566–4575. Cited by: §3, §3.
-  (2017) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence (PAMI) 39 (4), pp. 652–663. Cited by: §1, §4.
-  (2019) Hierarchical photo-scene encoder for album storytelling. In AAAI, Cited by: §3, Table 1.
-  (2018-02) Show, reward and tell: automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI, pp. 7396–7403. Cited by: §1, §3, §4.
-  (2019) A multi-agent communication framework for question-worthy phrase extraction and question generation. In AAAI, Cited by: §4.
-  (2018-04) No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. In ACL, pp. 899–909. Cited by: §2, §2, §3, §3, §3, Table 1.
-  (2019) Topic-aware neural keyphrase generation for social media language. In ACL, pp. 2516–2526. Cited by: §4.
Recognizing contextual polarity in phrase-level sentiment analysis. In EMNLP, pp. 347–354. Cited by: §3.
A study of reinforcement learning for neural machine translation. In EMNLP, pp. 3612–3621. Cited by: §2.
-  (2017) Topic aware neural response generation. In AAAI, Cited by: §4.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, pp. 2048–2057. Cited by: §4.
-  (2011) Corpus-guided sentence generation of natural images. In EMNLP, pp. 444–454. Cited by: §4.
-  (2017) Hierarchically-attentive rnn for album summarization and storytelling. In EMNLP, pp. 966–971. Cited by: §1, §2, §3, §3, Table 1.
-  (2017-10) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, Cited by: §1.
Appendix A Hyper-Parameter Sensitivity Analysis
Impact of and
In our experiments, we try many different values of and on the validation set. We found that and = 0.7 best balances the topic extractor and the story generator. As we know, the overall performance of our model depends more on the generated story rather than the generated topic. The topic guides story generation, which plays an auxiliary role. So the weight for the topic extractor should be lower than the weight of story generator.
Impact of and
The choice of the hyper-parameter affects the performance of the model. In our experiments, we find that when = 1 or 2, the performance of the model is better than = 0, and = 2 performs best on the validation set. But when = 3, the performance declines. In addition, we observe that = 0.3 plays a very good regulating effect.
The hyper-parameter controls the trade-off between MLE and RL objectives. For comparison, we set to be [0, 0.5, 0.8, 1] in our experiments. The results show that, when =[0.5,0.8,1], the model achieves a better performance than = 0; and =0.8 best balances the RL loss and the MLE loss.