ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer

by   Kohei Uehara, et al.
The University of Tokyo

Image narrative generation describes the creation of stories regarding the content of image data from a subjective viewpoint. Given the importance of the subjective feelings of writers, characters, and readers in storytelling, image narrative generation methods must consider human emotion, which is their major difference from descriptive caption generation tasks. The development of automated methods to generate story-like text associated with images may be considered to be of considerable social significance, because stories serve essential functions both as entertainment and also for many practical purposes such as education and advertising. In this study, we propose a model called ViNTER (Visual Narrative Transformer with Emotion arc Representation) to generate image narratives that focus on time series representing varying emotions as "emotion arcs," to take advantage of recent advances in multimodal Transformer-based pre-trained models. We present experimental results of both manual and automatic evaluations, which demonstrate the effectiveness of the proposed emotion-aware approach to image narrative generation.


page 1

page 3

page 5

page 8


Modeling Protagonist Emotions for Emotion-Aware Storytelling

Emotions and their evolution play a central role in creating a captivati...

Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response Generation

For a computer to naturally interact with a human, it needs to be human-...

MIME: MIMicking Emotions for Empathetic Response Generation

Current approaches to empathetic response generation view the set of emo...

Computational Emotion Analysis From Images: Recent Advances and Future Directions

Emotions are usually evoked in humans by images. Recently, extensive res...

A Circular-Structured Representation for Visual Emotion Distribution Learning

Visual Emotion Analysis (VEA) has attracted increasing attention recentl...

Automatic Comic Generation with Stylistic Multi-page Layouts and Emotion-driven Text Balloon Generation

In this paper, we propose a fully automatic system for generating comic ...

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

There have been many studies in robotics to improve the communication sk...

1. Introduction

Figure 1. ViNTER controls the emotional changes in the generated narratives via an input “emotion arc.” For example, if the input emotion arc is “neutral sadness neutral”, the model produces a narrative in which the middle part represents a sad feeling. In contrast, if we input “neutral joy joy” as the emotion arc, the model produces a narrative filled with happy emotions from the midpoint onward.

Image captioning is a representative and essential task in research on vision and language. Automatic image captioning facilitates the retrieval of desired images and is also used to communicate the content of images to visually impaired people. Several studies have attempted to generate more accurate and detailed information. Moreover, some procedures generate creative texts, such as visual storytelling  (Huang et al., 2016a; Liu et al., 2017), which uses image sequences as inputs to generate narrative text.

Shin et al. (2018) studied the generation of narrative text from a single image, and Liu et al. (2018) considered poetry generation in the same context. These methods, which are designed to generate creative text from an image or sequence of images, can be considered an important step toward

creative artificial intelligence (AI)


In this study, we use the term image narrative to refer to the definition in (Shin et al., 2018). That is, an image narrative is distinguished from a traditional image caption, which aims to describe the factual aspects of the elements contained in the image, in that it includes a broader range of subjective and inferential elements.

To investigate image narratives, we focus on “emotion” as an essential element of narrative text. The relationship between stories (narratives) and emotions has been an essential area of research in the humanities, represented by scientific investigations into the cognitive and affective impacts of literature (Hogan, 2006; Pandit and Hogan, 2006; Johnson-Laird and Oatley, 2008; Hogan, 2010, 2019).111In this study, we do not strictly distinguish between the terms “story” and “narrative”.

These findings have been introduced in natural language processing (NLP), and various studies have been conducted on understanding and generating narratives. Efforts to develop a computational approach to emotions in narrative texts can be traced back to the 1980s 

(Anderson and McMaster, 1982)

. Story generation with sentiment control has also been explored 

(Luo et al., 2019; Dathathri et al., 2020; Xu et al., 2020).

In recent years, Transformer-based methods have become standard in NLP research. Since Vaswani et al. (2017) proposed the original Transformer, many Transformer-based methods have been proposed, not only in research on language processing, but also in bimodal works on vision and language (Tan and Bansal, 2019; Lu et al., 2019; Su et al., 2020).

In this study, we proposed an approach designed to generate creative sentences from images using Transformer-based pre-trained models, and discuss the applicability of Transformer-based models to creative AI in the fields of machine vision and language processing. Moreover, we refer to the idea of an “emotion arc,” which is a visualization of the fluctuation of emotions as time series data, which has been confirmed to have a close relationship with narrative structure (Reagan et al., 2016). The proposed approach is called ViNTER (Visual Narrative Transformer with Emotion Arc Representation), and is a story-generation model that uses an image and the target emotion arc.

Our main contributions are summarized as follows.

  • We propose ViNTER as a novel image-narrative generation method. Benefitting from recent advances in multimodal Transformer-based pre-trained models and emotion-awareness, ViNTER exhibited notable performance on image-narrative generation tasks, as measured in both automatic and manual evaluations.

  • We discuss the applicability of Transformer-based models to creative AI in the fields of machine vision and language processing.

2. Related Work

Figure 2.

Region-based features extracted from the image and emotion arc are fed into the bi-directional Transformer encoder. Then, the output of the encoder is fed into the decoder, which generates each word of the narrative sequentially.

2.1. Emotion in Narratives

Emotion is an essential element in storytelling. Referring to Ackerman and Puglisi (2012), Kim and Klinger (2019a) observed that emotion is a key component of every character, and analyzed how emotions were expressed non-verbally in a corpus of short fanfiction stories. Lugmayr et al. (2017) also noted emotions as a fundamental aspect of storytelling as part of the cognitive aspects evoked by a story in its audience.

As described by Anderson and McMaster (1982), numerous efforts have been made to elucidate the relationship between emotions and stories. Based on the 1,000 most frequently used English words for which Heise (1965) obtained semantic differential scores, Anderson and McMaster (1982)

reported the development of a computer program designed to assist in the analysis and modeling of emotional tone in text by identifying those words in passages of discourse. Emotional analysis of text is closely related to lexicons annotated with emotions, and psychological findings have been referred to in these annotations and analyses.

Kim and Klinger (2019b) pointed out three theories of emotion commonly used in computational analyses, including Ekman’s theory of basic emotions (Ekman, 1993), Plutchik’s wheel of emotion (Plutchik, 1980), and Russell’s circumplex model (Russell, 1980).

Several studies have attempted to control story generation by considering emotions (Chandu et al., 2019; Luo et al., 2019; Brahman and Chaturvedi, 2020; Dathathri et al., 2020; Xu et al., 2020). Most of these works have focused on sentiment control, that is, positive or negative control of story generation. Story generation with a greater variety of emotions had not been addressed until recently. To the best of our knowledge, Brahman and Chaturvedi (2020) performed the first work on emotion-aware storytelling, which considered the “emotion arc” of a protagonist.

Referring to a talk by Kurt Vonnegut (Vonnegut, 1995)

, a famous American writer, attempts have been made to classify stories by drawing an “emotion arc” 

(Reagan et al., 2016; Chu and Roy, 2017; Somasundaran et al., 2020; Vecchio et al., 2020). Reagan et al. (2016) showed that stories collected from Project Gutenberg222 could be classified into six styles by considering their emotion arcs (i.e., the trajectory of average emotional tone expressed over the course of a story).

Based on this idea, Brahman and Chaturvedi (2020) considered variations in the emotions expressed by story protagonists as time series to generate emotion-aware stories. They split a five-sentence story into three segments, including a beginning (the first sentence), a body (the second to fourth sentences), and an ending (the last sentence). They applied labels for emotions in each segment to represent the emotional arc, and utilized them in story generation.

2.2. Transformer-based Language Models for Vision and Language

Transformer was proposed as an encoder-decoder model. Both the encoder and decoder parts include a self-attention mechanism. Language models designed to be pre-trained using large unsupervised datasets have been proposed and have been shown to perform well in various natural language processing tasks. Typical examples include Generative Pre-trained Transformer (GPT) (Radford et al., 2018) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). Whereas these methods use only decoder and encoder parts, respectively, the Bidirectional and Auto-Regressive Transformer (BART) (Lewis et al., 2020) has been proposed as a typical encoder-decoder model; such methods have been shown to be more versatile.

In recent years, the Transformer architecture has made a significant impact, not only on NLP, but also on other modalities. Moreover, multimodal applications such as those combining vision and language, have attracted the attention of researchers. There are two main approaches that combine vision and language processing in Transformer models, including two-stream and single-stream models. Two-stream models (Lu et al., 2019; Tan and Bansal, 2019) utilize separate Transformers for each modality, and a cross-modality module is adopted. Single-stream models (Su et al., 2020) directly input the text and visual embeddings into a single Transformer. Su et al. (2020) argued for the usefulness of using single-stream methods, as opposed to ViLBERT (Lu et al., 2019) and LXMERT  (Tan and Bansal, 2019), each of which adopts a two-stream approach. They explained that the network architecture of the attention pattern in the cross-modal Transformer is restricted in ViLBERT and LXMERT, and proposed VL-BERT as a unified architecture based on Transformer models, without any restriction on the attention patterns, in which the visual and linguistic contents interact early and freely. Thus, the single-stream method has shown promising results thus far.

The proposed method is based on UNITER (Universal Image-Text Representation) (Chen et al., 2020) as a single-stream model. UNITER has shown excellent performance in many vision and language tasks, such as visual question answering (VQA) and image-text retrieval, with a relatively simple configuration of a single stream.

The method proposed in Yu et al. (2021) is the closest approach to our work in the literature. The authors considered the Transitional Adaptation of a Pretrained Model (TAPM), which showed state-of-the-art performance on a visual storytelling task. The main difference between the proposed approach and that of (Yu et al., 2021) is that we focus on image-narrative generation, whereas Yu et al. (2021) applied their method to visual storytelling. The task of visual storytelling was proposed by Huang et al. (2016b), along with an associated dataset called VIST. They explored visual storytelling as a sequential vision-to-language task. In contrast image narrative generation is a single vision-to-language task. VIST has a one-to-one alignment between images and text. In other words, visual storytelling generates a single sentence per image, while considering the preceding and subsequent images as well. In contrast, the image narrative generation task, an entire narrative is generated from a single image. We believe that image narrative generation is a more creative task, in which a narrative sequence must be imagined from the relatively limited data contained in a single image. Moreover, to the best of our knowledge, the present work is the first to incorporate emotion arcs in image narrative generation to consider the importance of emotions in a story.

2.2.1. Attention Mechanism in Transformers

Transformer’s attention mechanism was proposed based on the flow of introducing the attention mechanism into sequence-to-sequence model for machine translation (Bahdanau et al., 2015; Luong et al., 2015)

. Although the attention mechanism was introduced into the field of NLP as an auxiliary mechanism to compensate for information that cannot be captured by the recurrent neural network

Vaswani et al. (2017) proposed the Transformer as a model with attention as the main mechanism.

Transformer’s attention mechanism maps a query and key-value pairs to an output. Their particular “Scaled Dot-Product Attention” can be described as follows.


, , stand for query, key, and value, respectively. is the dimension of , and scaling factor was introduced to prevent from dot product growing large in magnitude and causes extremely small gradients in softmax function.

The encoder of the vanilla transformer contains self-attention layers. In a self-attention layer, queries and key-value pairs are come from the same place. The relationship, or attention, between each element of the sequence given as input and all other elements is calculated. In the decoder, each layer uses an output of the previous layer as queries, and key-value pairs come from the output of the encoder.

UNITER is a model that uses only the Encoder part of the Transfomrer, and embeds the input image and text, and then processes them with the self-attention mechanism.

Our method, ViNTER, is an encoder-decoder model like BART, and its encoder is a modified version based on UNITER.

3. Proposed Method

We propose ViNTER (Visual Narrative Transformer with Emotion arc Representation) as a method to integrate time-series data quantifying changes in emotion to perform image narrative generation. The overview of the model is shown in Figure 2. To encode an image and the desired emotions simultaneously, we developed ViNTER’s encoder based on UNITER (Chen et al., 2020), a well-known multimodal pre-trained Transformer model. UNITER is a vision-and-language pre-trained model that can encode both visual and textual features with a single Transformer encoder. After the image and emotion features are encoded, the Transformer decoder was trained to generate the corresponding narratives. We describe each of the encoder and decoder in detail in the following sections.

Figure 3. Visualization of the frequency of each emotion that appears in each segment of the emotion arc in the Image Narrative dataset. From the center to the outside of the graph, the number of emotions contained in each segment (beginning, body, and end) is shown.

3.1. Emotion Arc Labeling

The proposed approach adopts emotion arcs as an emotional representation of the input of the model. Emotion arc represents the sequence of emotions expressed over the course of a story. In this study, we consider a story that consists of five sentences. We then split the story into three segments, including a beginning, a body, and an ending, as in (Brahman and Chaturvedi, 2020). We consider the sequences of emotions that match each segment as an emotion arc. For example, if the story begins with a joyful sentence, the body segment expresses fear, and the end segment expresses surprise, then the emotion arc is represented as “joy fear surprise.”

To label sentences with emotion, we use an emotion classifier fine-tuned with GoEmotions. GoEmotions is a corpus created to fine-tune emotion prediction models.333 Referring to (Brahman and Chaturvedi, 2020), we rely on Ekman’s theory of basic emotions and use 6 + 1 categories (anger, disgust, fear, joy, sadness, surprise, and neutral). We split the narratives into beginning, body, and end segments, and applied the emotion classifier to obtain the emotions corresponding to each segment. A visualization of the number of each emotion that appeared for each segment of the emotion arc is shown in Figure 3.

3.2. Encoder

3.2.1. Visual Embeddings

One of the inputs for the ViNTER encoder is image feature . indicates the number of object regions proposed by the object detector and indicates the dimensions of the image feature. We use Faster R-CNN (Ren et al., 2017) as an objection detection model, which is pre-trained with the Visual Genome dataset (Krishna et al., 2017; Anderson et al., 2018)

. To explicitly provide positional information on the image region, we add positional embeddings based on the region coordinates, as used in UNITER. Specifically, we use the region coordinates represented as the 7-dimensional vector consisted of normalized top/left/bottom/right coordinates, width, height, and area (multiplied by width and height).

3.2.2. Emotion Embeddings

In order to obtain the emotion labels of each segments, we use the BERT model fine-tuned with GoEmotions444 As a taxonomy of emotions, we adopt seven emotions (anger, disgust, fear, joy, sadness, surprise + neutral) based on Ekman’s six basic emotions.

To input the emotion arc into the encoder, the proposed approach uses a special [EMOTION SEP] token to concatenate the words denoting each emotion (e.g., joy, [EMOTION SEP], fear, [EMOTION SEP], surprise). We used the learned word embeddings from BERT (Devlin et al., 2019) to convert the emotion-word tokens into emotion vectors . We add the relative positional embeddings, similar to UNITER, to indicate which word corresponds to which part of the story.

3.2.3. Encoder Architecture

Having obtained both visual and emotional embeddings, we concatenate them and input them into the Transformer encoder. We adopt UNITER, as a single-stream Transformer, to encode both visual and emotional embeddings. Each Transformer block consists of a stack of self-attention layers.

3.3. Decoder

The decoder of ViNTER follows the decoder architecture of BART (Lewis et al., 2020)

. Our decoder is composed of a stack of Transformer blocks. Each block includes multi-headed self-attention and cross-attention layers, and two fully-connected layers. To avoid seeing the information of a future word, we apply a causal mask to the input of the self-attention layer. Finally, we employ the fully-connected layer to compute the probability distribution over the entire vocabulary. We trained our model parameters

by minimizing the negative log-likelihood over the probability distribution of the ground-truth narrative

, as follows.


4. Experiment

4.1. Dataset

We used the Image Narrative dataset (Shin et al., 2018)

to conduct our experiments to validate the performance of the proposed approach. This dataset contains 11,001 images drawn from MS COCO 

(Lin et al., 2014) and 12,245 image narratives annotated through crowdsourcing. Each narrative consists of five sentences. The crowd workers were instructed to not only accurately describe the contents of the images, but also to include local elements and sentiments that could be inferred or imagined from the images. Because the test set of the Image Narrative dataset is not publicly available, we further split the training data of the original dataset into training and evaluation data, taking care to avoid overlapping images. Consequently, we obtained a dataset with 5,398 narratives allocated to training and 1,813 to evaluation.

The narratives in this dataset contain noisy sentences; for example, some do not start with a capital letter or have extra or missing space around the period. Therefore, we preprocess the sentences by removing or adding spaces around periods as needed, capitalizing the beginning of the sentences, and ensuring that the remainder was correctly spelled in lowercase.

4.2. Implementation Details

We set the image-feature dimension to 2,048 and the object region number to 36. The number of hidden units of the Transformer block in the encoder and the decoder was set to 768. The number of Transformer blocks in the encoder and decoder was set to 12. We used the AdamW optimizer (Loshchilov and Hutter, 2019) with . We applied the cosine annealing schedule with an initial learning rate of and a warmup step of 1,000. We trained the model for 5,000 steps with a batch size of 32. The training took about two hours using a single Tesla A-100 GPU.

4.3. Baselines

We compared our method with the following baselines. DenseCap (Johnson et al., 2016) is a captioning model that generates captions focusing on a specific region. Shin et al. (2018) generates narratives based on the user’s interests through their answers to questions about the image. For the above methods, we list the values as they appear in (Shin et al., 2018), which proposed the Image Narrative dataset. Note that their results are evaluated on the test set of the dataset, which is not publicly available. Hence, their scores are not directly comparable to ours; however, we include them for reference. Image Only has the same model architecture as the proposed model, but used only image region features as input. ViNTER are methods designed to control emotions in narratives in a simpler manner, using only a single emotion as input. Each model uses an emotion corresponding to the beginning, body, and end segments as inputs.

4.4. Evaluation Metrics

4.4.1. Automatic

For evaluation by comparison with the ground-truth narratives, we use BLEU (Papineni et al., 2002) and the F1 score of BERTScore (Zhang* et al., 2020)

. BLEU is a widely used metric in image captioning calculated based on n-gram matching between the generated and reference texts. In recent years, research has been conducted on metrics that use text features obtained from a large-scale pre-trained language model. BERTScore, which uses BERT 

(Devlin et al., 2019) to calculate sentence embeddings, is a typical example of such a metric.

We also evaluated the accuracy of the emotions expressed in the generated narratives. We use Begin-acc, Body-acc, and End-acc

as metrics to evaluate whether the emotions estimated for the generated narratives for the segments

begin, body, and end, respectively, are consistent with the corresponding parts of the input emotion arc. Seg-acc is the fraction of matches between the correct emotion for each segment of the ground-truth and the estimated emotion for each segment of the generated narrative. Arc-acc similarly measures the agreement of the estimated emotions; however, here we measure the agreement of the entire emotion arc, not each segment.

Models BLEU-4 BERTScore
DenseCap (†) 1.90 -
Shin et al. (2018) (†) 1.41 -
Image Only 7.288 88.40
ViNTER 7.292 88.46
ViNTER 7.702 88.44
ViNTER 7.316 88.46
ViNTER 7.684 88.50
Table 1. Automatic evaluation results in terms of the quality of the generated narratives. For the models marked with (†), we put the values provided in (Shin et al., 2018). Note that these models are evaluated on the test data of the dataset and cannot be directly compared to the results of the other models. The best values are shown in bold and the second best values are underlined.
Models Begin-acc Body-acc End-acc Seg-acc Arc-acc
Image Only 92.06 48.65 55.38 65.36 26.31
ViNTER 93.16 51.85 53.83 66.28 27.25
ViNTER 91.23 66.46 55.43 71.04 33.65
ViNTER 91.73 52.73 61.89 68.78 30.72
ViNTER 93.05 65.97 60.89 73.30 38.61
Table 2. Automatic evaluation results in terms of emotional accuracy. The best values are shown in bold and the second best values are underlined.

4.4.2. Human Evaluation

Figure 4. Screenshot of the actual evaluation interface presented to AMT workers. The workers are asked to evaluate the quality of the narratives generated by ViNTER and the narratives generated by one of the comparison methods.

We conducted the human evaluation with Amazon Mechanical Turk.555 For comparison, we used four methods from the aforementioned baselines: Image Only, ViNTER, ViNTER, and ViNTER. We paired the proposed method with each of the four comparison methods and conducted an evaluation experiment using 100 images. Three workers were asked to evaluate a pair of image narratives generated by our proposed method and one of the comparison methods. In order to compare how the change in emotion arc affects the methods, we also prepared the generated narratives in which the target emotion arcs were limited to those that changed in the middle of a narrative, i.e., always included two or more emotions. We distinguish these methods as “w/ change,” i.e. with change. For each of the proposed and comparison methods, we also evaluated the generation of w/ change. Hence, there are a total of 800 narratives and 13 workers for each.

For each question, a worker was shown an image and two story-like texts associated with the image. One text has the “with emotion” setting, and the other has the comparison method setting. These texts were created by referring to the images and considering the flow of emotions. The worker was also presented with the emotion arc that these texts targeted.

Referring to the image and the emotion arc, the worker was required to score each story-like text.

The evaluation criteria are as follows:

  • Does this text relate to the reference image?

  • Does this text match the target emotion arc?

The criteria are designed to evaluate whether the generated text is suitable as an image narrative and matches the required emotion arc. A five-point scale evaluation was used for each criterion: higher is better (5: Excellent, 4: Good, 3: Fair, 2: Poor, 1: Bad). An example screenshot of the work is show in Figure 4.

5. Results and Discussion

Better / Tie / Worse (%)
Image Emotion Image (w/ change) Emotion (w/ change)
ViNTER vs. Image Only 27.38 / 46.38 / 26.23 27.54 / 44.08 / 28.38 22.62 / 54.92 / 22.46 24.62 / 55.38 / 20.00
ViNTER vs. ViNTER 23.23 / 45.38 / 31.38 27.54 / 45.23 / 27.23 26.08 / 49.92 / 24.00 27.38 / 46.31 / 26.31
ViNTER vs. ViNTER 25.23 / 47.62 / 27.15 27.23 / 45.54 / 27.23 31.69 / 39.85 / 28.46 26.54 / 42.15 / 31.31
ViNTER vs. ViNTER 28.38 / 44.46 / 27.15 27.08 / 47.54 / 25.38 29.15 / 41.46 / 29.38 33.77 / 37.77 / 28.46
Table 3. Human evaluation results of the generated narratives in terms of the relevance to the image and emotion arc. Here, “Better” means the percentage of cases where the proposed method was evaluated as better quality than the baseline, “Worse” means the opposite, and “Tie” means the percentage of cases where the two methods were evaluated as equal.

5.1. Automatic Evaluation

The evaluation results of the automatic metrics in terms of the quality of the generated narratives are presented in Table 1.

In BLEU and BERTScore, which are evaluations based on comparisons with ground-truth narratives, ViNTER performs better than, or as well as, other methods. ViNTER outperforms the existing methods, DenseCap and (Shin et al., 2018), by a large margin. This suggests that our method is superior to conventional methods, even though considering that these evaluations were performed using test data that cannot be used. ViNTER also performs considerably better than the Image Only model. However, when comparing the proposed method to models with a single emotion as input, in particular, ViNTER, the difference in scores becomes smaller, or the comparison method scores slightly better. This may be due to the shortage of metrics using comparisons with ground-truth data in the evaluation of narrative generation.

In the case where there is a change in the emotion arc, the superiority of ViNTER in terms of the quality of emotion is even more pronounced. This is an expected result, considering that when there is no change in the emotion arc, there is less likely to be a large difference in the generated narratives compared to the other methods. It is difficult to accurately evaluate the performance of models in reference-based metrics, when a variety of outputs are expected for the same input, which is exactly the case for narrative generation.

We also present the results in terms of the accuracy of the emotions of the generated narratives in Table 2. The model with the emotion input for each segment has the best score for each corresponding segment (ViNTER for Begin-acc, ViNTER for Body-acc, and ViNTER for End-acc). However, models that input emotions for a single segment have much lower accuracy compared to the other segments. These results are in contrast to the results of ViNTER with an emotion arc as input, which scored the second-best for all segments. In addition, ViNTER with an emotion arc input performs significantly better than the other methods in evaluating the accuracy of the emotions across the entire generated narrative (Seg-acc, Arc-acc). These results show that our proposed method is capable of controlling emotional changes, not only in a part of the narrative but also in the complete narrative.

5.2. Human Evaluation

The results of the human evaluation are shown in Table 3. First, looking at the case of no change in emotion (left half of the table), ViNTER tends to be rated as having equal or slightly higher quality of emotional control compared to baseline. In some cases, ViNTER is rated as worse in terms of the match between the image and the content of the generated narrative. However, this is not particulary suprising given the fact that ViNTER is trained to generate narratives that follow not only the content of the image but also the emotion arc. This suggests that ViNTER may have prioritized the emotion arc-based narrative generation over the image content in some cases. This is an expected result, considering that when there is no change in the emotion arc, there is less likely to be a large difference in the generated narratives compared to the other methods. From these results, we can conclude that ViNTER is able to generate narratives that better capture the changes in emotion arc than the baseline method.

Figure 5. Qualitative examples for the same image with two different emotion arc inputs.

5.3. Case Studies

To demonstrate the controllability of the narratives generated by the proposed model, we show examples of the narratives generated from the same image with different emotion arcs in Figure 5. Overall, our ViNTER could generate different narratives for the same image, depending on the input emotion arc. For example, in the top left image, when describing an animal, if “joy” is input as the middle emotion, a positive sentence, such as “He looks happy,” will be generated, whereas if “sadness” is input as the middle emotion, a negative sentence, such as “He seems to be sad,” will be generated. However, in some cases, such as when “joy” is followed by a negative emotion (e.g., “disgust” or “anger”), the output may be inconsistent with the previous sentence.

6. Conclusion

In this study, we have addressed the task of generating narrative sentences from an image using a Transformer-based method to process multimodal vision and language data. Our proposed method has shown promising results on both automatic and human evaluation, and case studies showed that our method could control generation with an emotion arc.

However, some room for improvement remains in terms of the way these emotions are modeled. The emotion arc only considers emotions from one perspective, but multiple emotions are typically associated with a given narrative. Iglesias (2005) insisted on the importance of distinguishing character and reader emotions. Although the handling of emotions in this study was simplistic, we believe that this work is an important step toward utilizing more diverse and difficult-to-handle emotions in story generation.

For further discussion on the importance of “creativity” in image narrative generation for the social implementation of creative AI, we refer to the idea of creative systems. Colton (2008) introduced the notion of the creative tripod. The three legs represent the three behaviors required in their system, including skill, appreciation, and imagination. Later, Elgammal et al. (2017)

rephrased this notion as the ability to produce novel artifacts (imagination), the ability to generate high-quality artifacts (skill), and the ability of a model to assess its own creations. Among these three goals of creative systems, our proposed method has achieved “imagination” and “skill.” However, the ability to assess creativity has not been established. To achieve this, a system must be developed that can evaluate the image narratives it generates by itself; that is, the system must be able to learn criteria for a good image narrative. As may be noted from the discussions of the results of the human evaluation, the assessment of creativity is not an easy task, even for humans. We expect the development of methods to automatically evaluate narratives would be of considerable benefit in enabling great strides towards the realization of creative AI.


This work was partially supported by JST AIP Acceleration Research JPMJCR20U3, Moonshot R&D Grant Number JPMJPS2011, JSPS KAKENHI Grant Number JP19H01115, and JP20H05556 and Basic Research Grant (Super AI) of Institute for AI and Beyond of the University of Tokyo.


  • (1)
  • Ackerman and Puglisi (2012) Angela Ackerman and Becca Puglisi. 2012. The Emotion Thesaurus: A Writer’s Guide to Character Expression. JADD Publishing.
  • Anderson and McMaster (1982) C. W. Anderson and G. E. McMaster. 1982. Computer assisted modeling of affective tone in written documents. Computers and the Humanities 16, 1 (01 Sep 1982), 1–9.
  • Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  • Brahman and Chaturvedi (2020) Faeze Brahman and Snigdha Chaturvedi. 2020. Modeling Protagonist Emotions for Emotion-Aware Storytelling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5277–5294.
  • Chandu et al. (2019) Khyathi Chandu, Shrimai Prabhumoye, Ruslan Salakhutdinov, and Alan W Black. 2019. “My Way of Telling a Story”: Persona based Grounded Story Generation. In Proceedings of the Second Workshop on Storytelling. Association for Computational Linguistics, Florence, Italy, 11–21.
  • Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In ECCV.
  • Chu and Roy (2017) E. Chu and D. Roy. 2017.

    Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies. In

    2017 IEEE International Conference on Data Mining (ICDM). 829–834.
  • Colton (2008) S. Colton. 2008. Creativity Versus the Perception of Creativity in Computational Systems. In AAAI Spring Symposium: Creative Intelligent Systems.
  • Dathathri et al. (2020) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
  • Ekman (1993) Paul Ekman. 1993. Facial expression and emotion. American Psychologist 48(4) (1993), 384–392.
  • Elgammal et al. (2017) Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. 2017. Can: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068 (2017).
  • Heise (1965) David R. Heise. 1965. Semantic differential profiles for 1,000 most frequent English words. Psychological Monographs: General and Applied 79, 8 (1965), 1 – 31.
  • Hogan (2006) Patrick Colm Hogan. 2006. Narrative Universals, Heroic Tragi-Comedy, and Shakespeare’s Political Ambivalence. College Literature 33, 1 (2006), 34–66.
  • Hogan (2010) Patrick Colm Hogan. 2010. A Passion for Plot: Prolegomena to Affective Narratology. symplokē 18, 1-2 (2010), 65–81.
  • Hogan (2019) Patrick Colm Hogan. 2019. Description, Explanation, and the Meanings of ”Narrative”. Evolutionary Studies in Imaginative Culture 3 (2020/10/19/ 2019), 45+. 1, 45, Critical essay.
  • Huang et al. (2016a) Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016a. Visual Storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 1233–1239.
  • Huang et al. (2016b) Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016b. Visual Storytelling. In 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016).
  • Iglesias (2005) Karl Iglesias. 2005. Writing for emotional impact : advanced dramatic techniques to attract, engage, and fascinate the reader from beginning to end. WingSpan Press.
  • Johnson et al. (2016) Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016.

    DenseCap: Fully Convolutional Localization Networks for Dense Captioning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Johnson-Laird and Oatley (2008) R. N. Johnson-Laird and Keith Oatley. 2008. Emotions, music, and literature. The Guilford Press, New York, NY, US, 102–113.
  • Kim and Klinger (2019a) Evgeny Kim and Roman Klinger. 2019a. An Analysis of Emotion Communication Channels in Fan-Fiction: Towards Emotional Storytelling. In Proceedings of the Second Workshop on Storytelling. Association for Computational Linguistics, Florence, Italy, 56–64.
  • Kim and Klinger (2019b) Evgeny Kim and Roman Klinger. 2019b. A Survey on Sentiment and Emotion Analysis for Computational Literary Studies. Zeitschrift fuer Digitale Geisteswissenschaften 4 (2019).
  • Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-jia Li, David A Shamma, et al. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740–755.
  • Liu et al. (2018) Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. 2018. Beyond narrative description: Generating poetry from images by multi-adversarial training. In Proceedings of the 26th ACM international conference on Multimedia. 783–791.
  • Liu et al. (2017) Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017.

    Let Your Photos Talk: Generating Narrative Paragraph for Photo Stream via Bidirectional Attention Recurrent Neural Networks. AAAI Conference on Artificial Intelligence.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.
  • Lugmayr et al. (2017) Artur Lugmayr, Erkki Sutinen, Jarkko Suhonen, Carolina Islas Sedano, Helmut Hlavacs, and Calkin Suero Montero. 2017. Serious storytelling - a first definition and review. Multimedia Tools and Applications 76 (07 2017), 15707–15733.
  • Luo et al. (2019) Fuli Luo, Damai Dai, Pengcheng Yang, Tianyu Liu, Baobao Chang, Zhifang Sui, and Xu Sun. 2019. Learning to Control the Fine-grained Sentiment for Story Ending Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6020–6026.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.

    Effective Approaches to Attention-based Neural Machine Translation. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1412–1421.
  • Pandit and Hogan (2006) Lalita Pandit and Patrick Colm Hogan. 2006. Introduction: morsels and modules: on embodying cognition in Shakespeare’s plays (1). College Literature 33 (2020/10/19/ 2006), 1+. 1, Article.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318.
  • Plutchik (1980) R. Plutchik. 1980. Emotion, a Psychoevolutionary Synthesis. Harper & Row.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
  • Reagan et al. (2016) Andrew J. Reagan, Lewis Mitchell, Dilan Kiley, Christopher M. Danforth, and Peter Sheridan Dodds. 2016. The emotional arcs of stories are dominated by six basic shapes.

    EPJ Data Science

    5, 1 (04 Nov 2016), 31.
  • Ren et al. (2017) S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
  • Russell (1980) James A. Russell. 1980. A circumplex model of affect. Journal of personality and social psychology 39 (1980), 1161–1178.
  • Shin et al. (2018) Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Customized Image Narrative Generation via Interactive Visual Question Generation and Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Somasundaran et al. (2020) Swapna Somasundaran, Xianyang Chen, and Michael Flor. 2020. Emotion Arcs of Student Narratives. In Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events. Association for Computational Linguistics, Online, 97–107.
  • Su et al. (2020) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Eighth International Conference on Learning Representations (ICLR).
  • Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5100–5111.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc., 5998–6008.
  • Vecchio et al. (2020) Marco Del Vecchio, Alexander Kharlamov, Glenn Parry, and Ganna Pogrebna. 2020. Improving productivity in Hollywood with data science: Using emotional arcs of movies to drive product and service innovation in entertainment industries. Journal of the Operational Research Society 0, 0 (2020), 1–28. arXiv:
  • Vonnegut (1995) Kurt Vonnegut. 1995. Kurt Vonnegut on the Shapes of Stories. Video. Accessed: September 3, 2021.
  • Xu et al. (2020) Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Anima Anandkumar, and Bryan Catanzaro. 2020. MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Punta Cana, Dominican Republic. arXiv:arXiv:2010.00840
  • Yu et al. (2021) Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, and Gunhee Kim. 2021. Transitional Adaptation of Pretrained Models for Visual Storytelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12658–12668.
  • Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations.