Using Inter-Sentence Diverse Beam Search to Reduce Redundancy in Visual Storytelling

by   Chao-Chun Hsu, et al.
Academia Sinica

Visual storytelling includes two important parts: coherence between the story and images as well as the story structure. For image to text neural network models, similar images in the sequence would provide close information for story generator to obtain almost identical sentence. However, repeatedly narrating same objects or events will undermine a good story structure. In this paper, we proposed an inter-sentence diverse beam search to generate a more expressive story. Comparing to some recent models of visual storytelling task, which generate story without considering the generated sentence of the previous picture, our proposed method can avoid generating identical sentence even given a sequence of similar pictures.


Stories for Images-in-Sequence by using Visual and Narrative Components

Recent research in AI is focusing towards generating narrative stories a...

WriterForcing: Generating more interesting story endings

We study the problem of generating interesting endings for stories. Neur...

A Simple and Effective Approach to the Story Cloze Test

In the Story Cloze Test, a system is presented with a 4-sentence prompt ...

Towards Coherent Visual Storytelling with Ordered Image Attention

We address the problem of visual storytelling, i.e., generating a story ...

Induction and Reference of Entities in a Visual Story

We are enveloped by stories of visual interpretations in our everyday li...

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Visual Storytelling (VIST) is a task to tell a narrative story about a c...

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

While much research has been done in text-to-image synthesis, little wor...

1 Introduction

Visual storytelling is gaining more popularity in recent years. The task requires machines to both comprehend the content of a stream of images and also produce a narrative story. The recent rapid progress of the neural networks has enabled the models to achieve promising performance on the task of image captioning Xu et al. (2015); Vinyals et al. (2017), which is also an image-to-text problem. Nonetheless, visual storytelling is much more difficult and complicated; besides generating an individual sentence solely, to form a complete story needs to take the coherency as well as the main focus of the story and even the creativity into consideration.

Park and KimPark and Kim (2015) viewed this problem as a retrieval task and incorporating the information of discourse entities to model the coherence. Some researchers designed a variation of GRU which can ”skip” some input which strengthens the ability to deal with longer dependencyLiu et al. (2016). However, retrieval-based methods are less general and limited to those seen sentences. In a task where the output varies dramatically, using a more flexible manner to generate stories seems to be more appropriate. HuangHuang et al. (2016) published a large dataset containing photo streams where each of them is paired with a story. They also proposed a neural baseline model of this task which based on the seq-to-seq framework. First encoding all images, and then they use the encoder hidden state as initial hidden state of decoder GRU and produce the whole story. In addition to visual grounding, we expect the generated stories can be as similar as possible to those written by human. This may be breakdown to several characteristics such as the style, the repeated use of words, or how detailed the paragraph is, etc. It is hard to enumerate all possible features, thus, the work Wang et al. (2018a) utilize the Seq-Gan framework, designing two discriminators, one is responsible for the degree of matching of an image and a sentence, the other focuses on the text only, trying to mimic the language style of human. More recent work analyzing the relation between scores of automatic metrics such as BLEU Papineni et al. (2002)

and those from human evaluation. They utilize Inverse Reinforcement Learning framework, attempting to ”learn” a reward by adversarial network which is similar to the criterion of human judgment

Wang et al. (2018b).

However, these neural models aims to learn a human distribution and have nothing to do with the creativity and also consumes lots of computing resource and time. In this paper, we proposed an inter-sentence diverse beam search which can generate interesting sentences and avoid redundancy given a sequence of similar photos. Comparing to the diverse beam search which generates different groups of sentence given a condition, our inter-sentence version focuses on producing various sub-stories given a sequence of images. The proposed model improves the max meteor scores on visual storytelling dataset from 29.3% to 31.7% comparing to the baseline model.

the friends were excited
to go out for a night .
we had a lot of fun .
we had a great time .
we had a great time .
we all had a great time .
Table 1: Repetitive story generated by baseline model

2 Method

2.1 Basic Architecture

The baseline model is a encoder-decoder framework as in Figure 1. We then apply proposed method on the top of the baseline model.


We utilize the pretrained Resnet-152 and extract the output of the second-to-last fully connected layer to form an 2048-dimensional image representation vector. Since a story is composed of five images, the model should take the order of them into consideration as well as memorize the content of previous pictures. Thus, a bidirectional GRU will take as input the five image vectors and produce five context-aware image embeddings.


Given an image embedding, another GRU is used to produce one sentence at a time. To form a complete story, we first generate five sub-stories separately and then concatenating them. We have tried both using image embedding as the initial hidden state and concatenated the image embedding with the word embedding as the input of each time steps. The latter one yields a smoother validation curve and thus we adopted this setting in our last submission.

2.2 Decoding techniques

We have noticed a major issue of the baseline model is that the generated sentences may repeat themselves as in Table 1 because of the similar image or generic story. E.g. I have a great time. To overcome this, we proposed a variation of beam search which takes into account the previous generated sentences.

Inspired by the diverse beam search method Vijayakumar et al. (2016)

, our model will aware of the words used in previous incomplete story when decoding one sub-story, and then calculating the score based on each next word’s probability and the diversity penalty, which will discuss below. Afterward, the model will rearrange the candidates according to the scores and selects the proper word. In our setting, we use bag-of-word to represent the previous sentences and adopting the relatively simple

hamming diversity, which punishes a selected token proportional to its previous occurrences, but any kind of penalty can be plugged into this framework. Different from Vijayakumar et al. (2016), their work deals with intra-sentences diversity while ours focus on the inter-sentences diversity, which is more suitable for this task.

Inter-sentence Diverse Beam Search

For the decoding of the first sentence, we perform a regular beam search with no diversity penalty. After that, we consider the diversity with all previous sentences for the following sentence generation process.

From a finite vocabulary and an input , the model output the sequence y based on the probability P(yx) where the output sequence y = . Let , we denote . At each time step , the set is updated by considering the set for the image with beams. Diverse beam search adds diversity penalty by calculating sentences with diversity function multiplied by the diversity strength . Using the notation of Vijayakumar et al. (2016), each time step of this process for image can be presented as,


baseline the trees were beautiful.
we took a picture of
the mountains.
we took a picture of
the mountains.
the mountains were
beautiful and the.
the river was
the lake was beautiful.
ours i went on a hike last week.
we had to take a picture
of the mountains.
they took pictures of the
lake and scenery.
this is my favorite part of the
trip with my wife ,i ’ve never
seen such a beautiful view!
it was very peaceful
and serene.
Table 2: Generated stories from two models.
Figure 1: The architecture of baseline model

In our experiments, we applied hamming diversity for and found achieved better results at 2. Our purposed inter-sentence diverse beam search is illustrated in Algorithm LABEL:alg:math_here_lol.

3 Visual Storytelling Dataset

Visual Storytelling dataset(VIST) is the first dataset of image sequence paired with text: (1) Descriptions of imagesin-isolation (DII); and (2) Stories for images-insequence (SIS). The dataset contain more than 20,000 unique photos in 50,000 sequences. We eliminate the stories that have broken images().

4 Experiments and results

4.1 Model Setup

We first scale the images to 224*224 and apply horizontal flip into training process. Then the images is normalized to fit in pretrained resnet-152 CNN model. For the hyperparameters, the image feature size is 256, and the hidden size of decoder GRU cell is 512. The word appearing more than three times in the corpus is pick into vocabulary. We select Adam as our optimizer and set learning rate to 2e-4. Schedule sampling and batch normalization are introduced in the training process.

4.2 Result

Every photo sequence in test set has 2 to 5 reference stories. We evaluate our models by max meteor score of references for a photo sequence. As we can see in Table 3, the inter-sentence diverse beam search improve the max meteor score from 29.3 to 31.7. Besides the improvement on metric, the proposed method can generate interesting sentences (Table 2).

Baseline Ours
Meteor 29.3 31.7
Table 3: Max meteor score(%)

5 Conclusion and Future Work

We proposed a new decoding approach for the visual storytelling task, which avoided to generate the repeated information of the photo sequence. Instead, our model would attempt to produce a diverse expression for the image.

Nevertheless, the value of diversity weight

requires human heuristics to determine the trade-off between the diversity and the output sequence probability. Future efforts should be dedicated to introduce a data-driven method to make machines learn to put the attention on either the sequence probability or the distinct details in the image.