Recent deep learning based approaches have shown promising results for vision-to-language problems[24, 11, 3, 29, 17, 4]
that require the generation of text descriptions from given images or videos. Most existing methods have focused on giving direct and factual descriptions of visual content. While this is a promising first step, it is still challenging for artificial intelligence to connect vision with more naturalistic and human-like language. One emerging task proposed to take one step closer to human-level description is visual storytelling. Given a stream (set) of photos, this method aims to create a narrative, evaluative and imaginative story based on semantic visual understanding. While conventional visual descriptions are visually grounded, visual storytelling tries to describe contextual flow and overall situation across the photo stream, and so its output sentences can contain words for objects that do not even appear in the given image. Therefore, filling in the visual gap between the given photos with a subjective and imaginative story is the main challenge of visual storytelling.
Generated Story: (a) The fans were excited for the game. (b) There were many people there. (c) The lead singer performed a great performance. (d) The game was very intense. (e) This was a great game of the game.
In this paper, we propose to explicitly learn to imagine the storyline that bridges the visual gap. To this end, we present an auxiliary hide-and-tell training task to learn such ability. As shown in Fig. 1, one or more photos in the input stack are randomly masked during training. We train our model to produce a full, plausible story even with a missing photo(s). This image dropout in training encourages our model to describe what is happening in the given stream of photos, as well as between the photos. Since this story imagination task is an ill-posed problem, we follow curriculum learning, in which we start with an original setting in the early steps, and gradually increase the number of image dropout during training.
Furthermore, we propose an imagination network that is designed to learn non-local relations across photo streams to refine and improve, in a coarse-to-fine manner, the recurrent neural network (RNN) based baseline. We build upon a strong baseline model (XE-SS) that has a CNN-RNN architecture and is trained with cross-entropy loss. Since we focus on learning contextual relations among all given photo slots, even those with missing photos, we propose to add a non-local (NL) layer 
after the RNN block to refine long-range correlations across the photo streams. Our imagination network is designed with the first CNN block, and a stack of two RNN-NL blocks with a residual connection between; the following gated recurrent unit (GRU) outputs the final storyline.
In the experimental section, we evaluate our results with automatic metrics of BLEU , METEOR , ROUGE , and CIDEr . We conduct a quantitative ablation study to verify the contribution of each of the proposed design components. Also, we compare our imagination network with existing state-of-the-art models for visual storytelling. By conducting a user study, we show that our results are qualitatively better than the baselines. Another user study demonstrates that our hide-and-tell network is able to predict a plausible overall storyline even with missing photos. Finally, we introduce a new task of story interpolation, which involves predicting language descriptions not only for the given images, but also for gaps between the images.
Our contributions are summarized as follows.
We propose a novel hide-and-tell training scheme that is effective for learning imaginative ability for the task of visual storytelling
We also propose an imagination network design that improves over the conventional RNN-based baseline.
Our proposed model achieves state-of-the-art visual storytelling performances in terms of automatic metrics.
We qualitatively show that our network faithfully completes the storyline even with corrupted input photo stream, and is able to predict inter-photo stories.
Visual storytelling is a problem of generating human-like descriptions with images selected from a photo album. Unlike conventional captioning tasks, visual storytelling aims to create a subjective and imaginative story with semantic understanding in the scenes. Early work  exploits user annotation from blog posts. Newly released VIST  dataset with a narrative story leads to several follow-up studies. Approaches with hierarchical concept [30, 25]
are proposed. And wang2018show,wang2018no wang2018show,wang2018no formulate a visual storytelling task using adversarial reinforcement learning methods.
Overfitting is a long-stand problem of deep neural network which causes difficulty in test cases. To alleviate this problem, dropout  is widely adopted. During training, it randomly drops weights in the neural networks to avoid severe co-adapting. For language models, a similar approach 
named blackout is proposed to increase stability and efficiency. While dropout is often used at hidden layers of networks, blackout only targets output layers. Recently, for captioning models, burns2018women burns2018women tries to overcome bias in gender-specific words by occluding gender evidence in training images.
Illustrated hiding methods motivate our input blind learning scheme. It randomly obscures one or two images from the input in the training stage. Since the VIST dataset has a fixed number of input images as five, there can be overfitting in learning relations among images. From this point of view, our hide-and-tell concept gains performance improvement from the perspective of regularization. Also, visual storytelling aims to generate subjective and imaginative descriptions unlike conventional captioning. In that regard, our approach has the advantage that the network learns to imagine the skipped input.
Inspired by the human learning process, bengio2009curriculum bengio2009curriculum proposed curriculum learning which starts from relatively easy task and gradually increases the difficulty of training. It benefits both performance improvement and speed of convergence in various deep learning tasks such as optical flow , visual question answering 
, and image captioning. We also exploit curriculum learning by scheduling the difficulty of a task. At the early steps of training, there is no obscured input. Then, one of the five input images is omitted in the later step. Lastly, two of the five input images are hidden. If a validation loss is saturated, each step goes into the next step.
Recently, a non-local neural network  is proposed to capture long-range dependencies with self-attention. in other words, it computes the relations along with spatio-temporal spaces. Also, the non-local layer is a flexible network that can be well suited to both convolution layers and recurrent networks. It is widely used to vision tasks such as scene graph generation , image generation , and NLP tasks such as image and video captioning , text classification and sequence labeling . We also exploit the self-attention mechanism of the non-local layer to our networks which try to imagine a story for the hidden images by learning relations between images.
An overview of the proposed imagination network is shown in Fig. 2. Given five input images , the model outputs five corresponding sentences . Each sentence consists of several words , where denotes the length of the sentence.
Our model operates in three steps: Hide, Imagine, and Tell. After the first convolutional layer, which extracts visual features from each input photo, the hiding step randomly blinds one or two image features. It is implemented by setting the selected feature values to 0. During training, we employ a curriculum learning scheme, which starts with a normal setting (without hiding) and gradually increases the number of hidden images to two image features. (i.e. 0 to 2). In our preliminary experiment, we found that blinding three or more image features does not provide further performance improvement.
Second, the imagining step consists of the aforementioned RNN-NL block. The goal of this step is to make a coarse initial prediction for the omitted features. Together with a residual connection from the CNN feature stack, this step captures contextual relations between the known image features, focusing on recovering the missing features. Finally, the telling step takes the feature stack from the previous imagining step, and refines the relational embedding to capture more concrete semantics throughout the photo stream. The RNN-NL block in this step shares the same architecture as that of imagining step, while the parameters are not shared. The refined feature stack is fed into the decoder to generate the final language output.
The input photo stream is fed into the pre-trained CNN layer, which extracts high-level image features . As shown in Fig. 2-(a), one or two of the features are randomly dropped in the hiding step. Although the missing information makes the reconstruction task an ill-posed problem, the method of hiding not only has a regularization effect but also helps our model to learn the contextual relations that lead to a performance gain in testing.
where denotes the number of input images, is a feature set including zero-masked features, is a masking weight which is randomly set during training.
It is very challenging even for human intelligence to recover the missing features by using the neighboring photos in the same input stack. To ease the training difficulty in early steps, we adopt a curriculum learning scheme . In early training, our imagination network is given fully visible photo stack (i.e. ). When the training loss becomes saturated, we start to hide one image feature from the input stack (i.e. ). Similarly, we proceed to hide two image features (i.e. ) in the later steps.
where are hyper parameters which are empirically determined as the saturation point. The effect of curriculum learning is shown in the experiment section (Table. 2).
Our imagination network (INet) is designed to learn contextual relations between images in the input stack, and to generate human-like stories even with omitted photo(s). Following a coarse-to-fine pipeline, our network includes a coarse imagining step and a fine telling step that correspond to Fig. 2-(b) and (d), respectively. We use RNN-NL block in both steps.
In the imagining step, is fed into the bidirectional gated recurrent unit (GRU). In the forward direction, Bi-GRU takes and embeds according hidden states . Then, in the backward direction, reversed hidden states are generated. The hidden states are concatenated into .
To model non-local relations between the images, we employ an embedded Gaussian version of a non-local neural network [22, 26]. As illustrated in Fig 3, our relational embedding is different from most existing non-local approaches in that it considers each input image feature as one element and focuses on the relations between the photo streams. Detailed equations for relational embedding are as follows:
where, is the hidden states from GRU (i.e. ), and each denotes 1D convolution layers because our approach does not consider the spatial dimension of each input image, but considers each image feature as one element.
Inspired from residual shortcuts , a reminding connection is added to connect initial CNN features to the end of the first RNN-NL block. By adding to , which is the output of the relational embedding layer, the first RNN-NL block is encouraged to focus on recovering the missing features.
In the telling step in Fig. 2-(d), the features from the previous imagining step, , are fed into the second RNN-NL block, which shares the same architecture as the first block, but does not share the weight parameters. The features that have been hidden during the hiding step are now roughly reconstructed in the feature stack , and the second RNN-NL block refines these features to allow more concrete and associative understanding of all the photos in the input stream. Thus, to make better language predictions, the second block focuses more on refining the features of all photo elements.
The decoder (Fig. 2-(e)) consists of GRU and generates sentences for each input photos. In order to generate each sentence , word
are recurrently predicted in one-hot vectoras follows:
where denotes fully connected (FC) layer and non-linearity (e.g. hyperbolic tangent function).
Our experiments are conducted on the VIST dataset which provides 210,819 unique photos from 10,117 Flickr albums for visual storytelling tasks. Given five input images selected from an album, corresponding five sentences annotated by users are provided as ground truth. For the fair comparison, we follow the conventional experimental settings used in existing methods [30, 27]. Specifically, three broken images are excluded in our experiments. Also, the same number of training, validation, and test sets are used: 4,098, 4,988, and 5,050.
We reproduced XE-ss  and set as our baseline network. However, our approach is completely different from their adversarial reinforcement learning method except for the baseline (i.e. XE-ss). ResNet-152  is used for the pre-trained CNN layer in Fig. 2. We empirically choose hyper parameters for curriculum learning; . The learning rate starts with , and it decays by half when the training difficulty is changed (i.e.
). Adam optimizer is used. For non-linearity in the network, ReLU is used for pre-trained CNN layers and SELU  is employed for the imagining step and the telling step. In decoding stage, beam search is utilized with . For fair experiments, we removed randomness along with different experiments by fixing a random seed. In other words, our experimental results do not rely on multiple trials.
|INet - B||63.7||39.1||23.0||13.9||35.1||29.2||9.9|
|INet - N||64.4||39.8||23.6||14.3||35.4||29.6||9.4|
|INet - R||63.5||39.0||22.9||13.9||35.0||29.4||9.2|
|(0, 1, 2)||64.4||40.1||23.9||14.7||35.6||29.7||10.0|
We conduct an ablation study to demonstrate the effects of different components of our method in Table. 1.
Our model has three distinctive components; hiding step, non-local attention layer, and imagination network. We investigate the importance of each component. If we provide non-blinded (i.e. fully-visible) input features to the model, the model loses the regularization effects. We call this model as INet-B. If we omit the non-local attention layers, the network should only rely on the recurrent neural network (RNN) to capture the inter-frame relations, missing the complementary effects of the non-local relations. We named this model as INet-N. If we do not use the telling step, the model only has one imagining step which shows insufficient performance to generate more concrete sentences on the photo stream. We named this model as INet-R.
In all ablation setups, we observe performance drops. The model INet-B shows that simply using all the image features is not enough to get good results as it is prone to overfitting. This shows the effectiveness of the proposed hide-and-tell learning scheme. The model INet-N suffers from its structural limitation as it purely depends on the recurrent neural networks for modeling the inter-frame relationship, and has difficulty handling complex relations between the frames. The result of model INet-R implies that the refinement stage after the first imagination step is crucial.
|Huang et al.||-||-||-||-||31.4||-||-|
|Yu et al.||-||-||21.0||-||34.1||29.5||7.5|
Comparison to Existing Methods
. Our approach achieves the best results in BLEU and METEOR metrics. Compared with previous approaches, our approach could better handle complex sentences. However, evaluation metrics are not perfect as there are many reasonable solutions for the narrative story generation. Therefore, we perform a user study and compare our approach with the strongest state-of-the-art baseline. For each user study (Table. 4, Table. 5), thirty participants answered twenty five queries. As shown in Table. 4, we see that our approach significantly outperforms the baseline, implying that our method produces much more human-like narrations.
|24.7 %||55.2 %||20.1 %|
|Full input||Hidden input||Tie|
|30.9 %||40.5 %||28.6 %|
We qualitatively compare our model with the baseline  in Fig. 5. We can observe that our model produces more diverse and comprehensive expressions. For example, in Fig. 5-(a), the repeated sentences (e.g. ”The flowers were so beautiful”) are generated by the baseline, whereas, the results of ours show a wide variety of sentences (e.g. ”Some of the flowers were very colorful.”, ”The flowers were blooming.”). Moreover, there exists an apparent gap in depicting the picture. For describing the second photo in Fig. 5-(a), ours ”There were many different kinds of shops there” is a better representation than baseline’s ”There were a lot of people there”. We observe a similar phenomenon in the example (b) as well. While the baseline repeats the same expressions such as ”There was a lot of food”, our network generates a wide variety of descriptions such as ”food”, ”ingredients”, ”meat”. The qualitative results above demonstrate again that our method greatly improves over the strongest baseline .
In this experiment, we explore the strength of INet by hiding the input images in testing. As shown in Fig. 5, one of the five input images is omitted. Specifically, the third and fifth image are masked in Fig. 5-(a) and Fig. 5-(b) respectively. We then show the story generated by ours and the baseline . We can clearly see that our method produces a much more natural story and well captures the associative relations between the images. For example, the results of the baseline do not even form a sentence such as (e.g. ”Diplomas and family members were there to support the.” or Diplomas all day.). On the other hand, the results of ours not only well maintains the global coherency over the sentences and are more locally consistent with neighboring sentences (e.g. ”The graduation ceremony was a lot of fun.” and ”After the ceremony, the students posed for a picture.”).
In Table. 5, we show that INet with one hidden image can generate a more human-like story than the INet without any hidden images. Thanks to the proposed hide-and-tell learning scheme, our INet is equipped with a strong imagination ability regardless of the input image masking.
The story interpolation is a newly proposed task in this paper. It aims to interpolate the story by predicting sentences in between the given photo stream. Since the photo stream has temporally sparse images, the current task of visual storytelling has limited expressiveness. However, the proposed story interpolation task can make the whole story more specific and concrete.
As illustrated in Fig. 6, a story for given five input images is generated. Additionally, the inter-story for inserted black images is also created with four sentences. The results of interpolation look thoroughly maintaining both global contexts over the whole situation and local smoothness with adjacent sentences. For instance, the generated sentence ”The Halloween party was over.” maintains both the global context of whole situation (i.e. halloween party) and local smoothness (i.e. party was over) preceded by ’[male] had a great time.”.
Motivated by the importance of imagination in the visual storytelling task, we extend our blinding test (Fig. 1) to the story interpolation task. While the blinding test recovers a story for the hidden input, story interpolation generates inter-story (i.e. five plus four, total nine sentences). Since creating a story by looking only at surrounding images without corresponding input obviously requires imagination, our hide-and-tell approach faithfully performs well due to our new learning scheme and network design.
In this paper, we propose the hide-and-tell learning scheme with imagination network for visual storytelling task which addresses subjective and imaginative descriptions. First, input hiding block omits an image from an input photo stream. Then, in imagining block, features of the hidden image are predicted by associating inter-photo relations with RNN and 1D convolution-based non-local layer. At the last, concrete relations between images are refined to generate sentences in the decoder. In experiments, our approach achieves state-of-the-art performance both in automatic metrics and human-subjective user studies. Finally, we propose a novel story interpolation task and show that our model well imagines the inter-story between given photo streams.
-  (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proc. of Association for Computational Linguistics Workshop, pp. 65–72. Cited by: Introduction.
Proc. of International Conference on Machine Learning, pp. 41–48. Cited by: Curriculum Learning.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In , pp. 2625–2634. Cited by: Introduction.
-  (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. Cited by: Introduction.
-  (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: Relational Embedding.
-  (2016) Deep residual learning for image recognition. In Proc. of Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Imagining Step, Implementation Details.
-  (2019) Hierarchically structured reinforcement learning for topically coherent visual story generation. In Proc. of Association for the Advancement of Artificial Intelligence, Vol. 33, pp. 8465–8472. Cited by: Comparison to Existing Methods.
-  (2016) Visual storytelling. In Proc. of North American Chapter of the Association for Computational Linguistics, pp. 1233–1239. Cited by: Introduction, Visual Storytelling, Comparison to Existing Methods.
Flownet 2.0: evolution of optical flow estimation with deep networks. In Proc. of Computer Vision and Pattern Recognition, pp. 2462–2470. Cited by: Curriculum Learning.
-  (2016) Blackout: speeding up recurrent neural network language models with very large vocabularies. Proc. of Int’l Conf. on Learning Representations. Cited by: Overcoming Bias.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proc. of Computer Vision and Pattern Recognition, pp. 3128–3137. Cited by: Introduction.
-  (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: Implementation Details.
-  (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: Introduction.
-  (2019) Contextualized non-local neural networks for sequence learning. In Proc. of Association for the Advancement of Artificial Intelligence, Cited by: Relational Embedding.
-  (2018) Learning by asking questions. In Proc. of Computer Vision and Pattern Recognition, pp. 11–20. Cited by: Curriculum Learning.
-  (2010) Rectified linear units improve restricted boltzmann machines. In Proc. of International Conference on Machine Learning, pp. 807–814. Cited by: Implementation Details.
-  (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In Proc. of Computer Vision and Pattern Recognition, pp. 1029–1038. Cited by: Introduction.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. of Association for Computational Linguistics, pp. 311–318. Cited by: Introduction.
-  (2015) Expressing an image stream with a sequence of natural sentences. In Proc. of Neural Information Processing Systems, pp. 73–81. Cited by: Visual Storytelling.
-  (2017) Deep reinforcement learning-based image captioning with embedding reward. In Proc. of Computer Vision and Pattern Recognition, pp. 290–298. Cited by: Curriculum Learning.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Overcoming Bias.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Imagining Step.
-  (2015) Cider: consensus-based image description evaluation. In Proc. of Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: Introduction.
-  (2015) Show and tell: a neural image caption generator. In Proc. of Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: Introduction.
-  (2019) Hierarchical photo-scene encoder for album storytelling. In Proc. of Association for the Advancement of Artificial Intelligence, Cited by: Visual Storytelling, Comparison to Existing Methods.
-  (2018) Non-local neural networks. In Proc. of Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: Introduction, Relational Embedding, Imagining Step.
-  (2018) No metrics are perfect: adversarial reward learning for visual storytelling. Proc. of Association for Computational Linguistics. Cited by: Introduction, Datasets, Evaluation Metrics, Implementation Details, Comparison to Existing Methods, Non-hiding Test, Hiding Test.
-  (2018) Linknet: relational embedding for scene graph. In Advances in Neural Information Processing Systems, pp. 560–570. Cited by: Relational Embedding.
-  (2016) Video paragraph captioning using hierarchical recurrent neural networks. In Proc. of Computer Vision and Pattern Recognition, pp. 4584–4593. Cited by: Introduction.
Hierarchically-attentive rnn for album summarization and storytelling.
Proc. of Empirical Methods in Natural Language Processing. Cited by: Visual Storytelling, Datasets, Evaluation Metrics, Comparison to Existing Methods.
-  (2019) Self-attention generative adversarial networks. Proc. of International Conference on Machine Learning. Cited by: Relational Embedding.