Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

11/24/2019 ∙ by Shizhe Chen, et al. ∙ Microsoft sina 0

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.



There are no comments yet.


page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A storyboard is a sequence of images to visualize a story with multiple sentences, which vividly conveys the story content shot by shot. The storyboarding process is one of the most important stages to create different story products such as movie, animation etc. It not only simplifies the understanding of textual stories with visual aids, but also makes following steps in story production go more smoothly via planning key images in advance. In this research, we explore how to automatically create a storyboard given a story.

Storyboard creation is nevertheless a challenging task even for professional artists, which requires many factors taken into consideration. Firstly, images in professional storyboards are supposed to be cinematic considering the framing, structure, view and so on. Secondly, the visualized image should contain sufficient relevant details to convey the story such as scenes, characters, actions etc. Last but not least, the storyboard should look visually consistent with coherent styles and characters across all images.

Figure 1. The proposed inspire-and-create framework for storyboard creation: a story-to-image retriever is firstly utilized to retrieve relevant image sequences from cinematic image set for inspiration; then a storyboard creator simulates the retrieved images to create more flexible story visualizations.

There are mainly two directions in the literature that can be adopted to address the automatic storyboard creation task, namely generation-based and retrieval-based methods. The generation-based method directly generates images conditioned on texts (reed2016generative, ; zhang2017stackgan, ; xu2018attngan, ; pan2017create, ; li2018storygan, ) by generative adversarial learning (GAN), which is flexible to generate novel images. However, it suffers from generating high-quality, diverse and relevant images due to the well-known training difficulties (goodfellow2014generative, ; salimans2016improved, ; pan2017create, ; li2018storygan, ). The retrieval-based method overcomes the difficulty of image generation via retrieving existing high-quality images with texts (kiros2014unifying, ; faghri2018vse++, ; nam2017dual, ; karpathy2015deep, ; ravi2018show, ; sun2019videobert, ). Nevertheless, current retrieval-based methods mainly contain three limitations for storyboard creation. Firstly, most previous endeavors utilize single sentence to retrieve images without considering context. However, the story context plays an important role in understanding the constituent sentence and maintaining semantic coherence of retrieved images. Secondly, the retrieval-based method lacks flexibility since it cannot guarantee an existing image is precisely relevant to a novel input story. Thirdly, retrieved images can be extracted from different sources that makes the whole storyboard sequence visually inconsistent in styles and characters.

To overcome limitations of both generation- and retrieval-based methods, we propose a novel inspire-and-create framework for automatic storyboard creation. Motivated by the embodied simulation hypothesis (lakoff2012explaining, ) in psychology which considers human’s understanding of language as the simulation of past experience about vision, sound etc., the proposed framework first takes inspirations from existing well-designed visual images via a story-to-image retriever, and then simulates the acquired images to create flexible story visualizations by a novel storyboard creator. Figure 1 illustrates the overall structure of the inspire-and-create framework. To be specific, we propose a contextual-aware dense visual-semantic matching model as the story-to-image retriever, which has three special characteristics for storyboard inspiration compared with previous retrieval works (kiros2014unifying, ; faghri2018vse++, ; nam2017dual, ; karpathy2015deep, ; ravi2018show, ; sun2019videobert, ): 1) dynamically encode sentences with relevant story contexts; 2) ground each word on image regions to enable future rendering; 3) visualize one sentence with multiple images if necessary. For the storyboard creator, we propose three steps for image rendering, including relevant region segmentation to erase irrelevant parts in the original retrieved image, style unification and 3D character substitution to improve the visual consistency on style and characters respectively. Extensive experimental results on in-domain and out-of-domain datasets demonstrate the effectiveness of the proposed inspire-and-create model. Our approach achieves better performance on both objective metrics and subjective human evaluations than the state-of-the-art retrieval based methods for storyboard creation.

The main contribution of this paper are as follows:

  • We propose a novel inspire-and-create framework for the challenging storyboard creation task. To the best of our knowledge, this is the first work focusing on automatic storyboard creation for stories in the wild.

  • We propose a contextual-aware dense visual-semantic matching model as story-to-image retriever for inspiration, which not only achieves accurate retrieval but also enables one sentence visualized with multiple complementary images.

  • The proposed storyboard creator consists of three rendering steps to simulate the retrieved images, which overcomes the inflexibility in retrieval based models and improves relevancy and visual consistency of generated storyboards.

  • Both objective and subjective evaluations show that our approach obtains significant improvements over previous methods. Our proposed model can create cinematic, relevant and consistent storyboard even for out-of-domain stories.

2. Related Work

Previous works for texts visualization can be broadly divided into two types, which are generation-based and retrieval-based methods.

2.1. Generation-based Methods

Generation-based methods (goodfellow2014generative, )

have the flexibility to generate novel outputs, which have been exploited in different tasks such as text generation

(liu2018beyond, ; li2019emotion, ), image generation (ma2018gan, ) and so on. Recent works also explore to generate images conditioning on input texts (reed2016generative, ; zhang2018photographic, ; pan2017create, ; li2018storygan, ). Most of them focus on single sentence to single image generation (reed2016generative, ; zhang2017stackgan, ; xu2018attngan, ). Reed et al. (reed2016generative, ) propose to use conditional GAN with adversarial training of a generator and a discriminator to improve text-to-image generation ability. Zhang et al. (zhang2017stackgan, ) propose Stacked GAN to generate larger size images via a sketch-refinement process in two stages. Xu et al. (xu2018attngan, ) employ attention mechanism to attend on relevant words when synthesizing different regions of the image. Recently, some endeavors have been put for generating image sequences given texts. Pan et al. (pan2017create, ) utilize GAN to create a short video based on a single sentence, which improves motion smoothness of consecutive frames. The storyboard creation, however, is different from short video generation, which emphasizes more on semantic coherency rather than low-level motion smoothness since a story can contain different scene changes. To tackle such challenge, Li et al. (li2018storygan, ) propose StoryGAN for story visualization which employs a context encoder to track the story flow and two discriminators at the story and image level to enhance quality and consistency of generated images. However, due to the well-known difficulties of training generative models (goodfellow2014generative, ; salimans2016improved, ), these works are limited on specific domains such as birds (zhang2017stackgan, ), flowers (xu2018attngan, ), numbers (pan2017create, ) and cartoon characters (li2018storygan, ) image generation where the structures are much easier, and the quality of generated image is usually unstable. Therefore, it is hard to directly apply generative models in complex scenarios such storyboard creation for stories in the wild.

2.2. Retrieval-based Methods

Retrieval-based methods ensure to generate high-quality images but suffer from flexibility. Most text-to-image retrieval works focus on the matching of single sentence and single image, which can be classified into global

(frome2013devise, ; kiros2014unifying, ; wang2016learning, ; faghri2018vse++, ; huang2018learning, ) and dense visual semantic matching models (nam2017dual, ; karpathy2014deep, ; karpathy2015deep, ; lee2018stacked, )

. The former employs fixed-dimensional vector as global visual and textual representation. Kiros

et al. (kiros2014unifying, ) firstly propose to use CNN to encode images and RNN to encode sentences. Faghri et al. (faghri2018vse++, ) propose to mine hard negatives for training. Sun et al. (sun2019videobert, ) utilize self-supervision from massive instructional videos to learn global visual semantic matching. Their main limitation is that global vectors are hard to capture fine-grained information. The dense matching models address such problem via representing image and sentence as a set of fine-grained components. Nam et al. (nam2017dual, ) sequentially generate image and sentence features, while Karpathy et al. (karpathy2015deep, ) propose to decompose the image as a set of regions and the sentence as a set of words. The alignment of word and image region not only improves retrieval accuracy, but also makes results more interpretable. However, few retrieval works have been explored to retrieve image sequences given a story with multiple sentences. Kim et al. (kim2015ranking, ) deal with long paragraphs but only require short visual summaries corresponding to a common topic. Ravi et al. (ravi2018show, ) is the closest work to ours to visualize a story with image sequences. They propose a coherency model that enhances the single sentence representation with a global hand-crafted coherence vector, and apply global matching to retrieve image for each sentence on visual storytelling dataset VIST (huang2016visual, ). In this work, we not only improve the story-to-image retrieval model via dynamic contextual learning and more interpretable visual semantic dense matching, but also propose an inspire-and-create framework (weston2018retrieve, ; hashimoto2018retrieve, ) to improve the flexibility of retrieval-based methods.

3. Overview of Storyboard Creation

In this section, we firstly introduce the storyboard creation problem in Section 3.1, and then describe overall structure of the proposed inspire-and-create framework in Section 3.2. Finally, we present our efforts for cinematic image collection in Section 3.3 which is the foundation to support the inspire-and-create model.

3.1. Problem Definition

Assume is a story consisting of sentences, where each sentence is a word sequence , the goal of storyboard creation is to generate a sequence of images to visualize the story . The number of images does not necessarily have to be equal to .

There are two types of training data for the task, called description in isolation (DII) and story in sequence (SIS) respectively. The DII data only contain pairs of single sentence and single image for training, while the SIS data contain pairs of story and image sequences for training. However, due to annotation limitations, the is equal to in SIS training pairs, which is not always desired in the testing phase.

3.2. Inspire-and-Create Framework

According to the embodied simulation hypothesis (lakoff2012explaining, ), our understanding of language is a simulation based on past experience. Mimicking such human storyboard creation process, we propose an inspire-and-create framework as presented in Figure 1. It consists of a story-to-image retriever to retrieve existing cinematic images for visual inspiration and a storyboard creator to avoid the inflexibility of using original images via recreating novel storyboard based on the acquired illuminating images.

Specifically, the retriever first selects a sequence of relevant images from existing candidate image set, which are of high-quality and maintain high coverage of details in the story to visualize the story and are employed to inspire the further creator. Since the candidate images are not specially designed to describe the story, though some regions of images are relevant to the story, there also exist irrelevant regions that should not be presented for interpreting the story. What is more, the images of high relevancy to the story might not be visually coherent on styles or characters, which can greatly harm human perceptions towards the generated storyboard. Therefore, the second module, storyboard creator, is proposed to render the retrieved images in order to improve visual-semantic relevancy and visual consistency.

3.3. Cinematic Image Collection

In order to provide high-quality candidate images for retrieval, we collect a large-scale cinematic image dataset called GraphMovie, which are crawled from a popular movie plot explanatory website111

In the website, human annotators sequentially explain events in a movie with a sequence of sentences. Each sentence is aligned with one image extracted from the corresponding period of the event in the movie. However, the semantic content of the sentence and the image might not be very correlated from visual aspect, because the sentence is an abstract description for the event rather than describing the visual content in the image. We crawled 8,405 movie explanatory articles from the website which include 3,089,271 aligned sentence and image pairs. The whole cinematic image set is our retrieval candidate set. However, it is very inefficient to apply detailed and accurate story-to-image retrieval on the overwhelming size of candidate images. In order to speed up retrieval, we utilize the aligned sentence to build an initial index for each image. Given an input sentence query, we first use the whole query or keywords extracted from the query to retrieve top 100 images via the text-text similarity based on this index, which can dramatically reduce the number of candidate images for each sentence. Then the story-to-image retrieval model is applied on the top 100 images ranked by the text-based retrieval.

4. Story-to-Image Retriever

There are mainly three challenges to retrieve a sequence of images to visualize a story containing a sequence of sentences. Firstly, sentences in a story are not isolated. The contextual information from other sentences is meaningful to understand a single sentence. For example, to visualize the following story “Mom decided to take her daughter to the carnival. They rode a lot of rides. It was a great day!”, contexts from the first sentence are required to understand the pronoun “they” in the second sentence as “mom and daughter”. And overall contexts are beneficial to visualize the third sentence that almost omits visual contents. Secondly, grounding words in image regions is preferred. The visual grounding not only can improve the retrieval performance via attending to relevant image regions for each word, but also enables future image rendering to erase irrelevant image regions for the story. Finally, the mapping from one sentence to multiple images can be necessary. Due to constraints of candidate images, sometimes it is hard to employ only one image to represent a detailed sentence, thus retrieving complementary multiple images is necessary.

In order to address the above challenges, we propose a Contextual-Aware Dense Matching model (CADM) as the story-to-image retriever. The contextual-aware story encoding is proposed in subsection 4.1 to dynamically employ contexts to understand each word in the story. In subsection 4.2, we describe the training and inference of dense matching which implicitly learns visual grounding. Further in subsection 4.3, we propose a decoding algorithm to retrieve multiple images for one sentence if necessary.

4.1. Contextual-Aware Story Encoding

Figure 2. The contextual-aware story encoding module. Each word is dynamically enhanced with relevant contexts in the story via a hierarchical attention mechanism.

The contextual-aware story encoding dynamically equips each word with necessary contexts within and cross sentences in the story. As shown in Figure 2, it contains four encoding layers and a hierarchical attention mechanism.

The first layer is the word encoding layer, which converts the one-hot word into a distributional vector via word embedding matrix . Then we utilize a bidirectional LSTM (hochreiter1997long, ) as the second sentence encoding layer to capture contextual information within single sentence for word as follows:


where is the vector concatenation, are parameters to learn, and is the word representation equipped with contexts in the sentence for word . We can also obtain representation for each single sentence by averaging its constituent word representations:


Since the cross-sentence context for each word varies and the contribution of such context for understanding each word is also different, we propose a hierarchical attention mechanism to capture cross-sentence context. The first attention dynamically selects relevant cross-sentence contexts for each word as follows:



is the nonlinear ReLU function and

are parameters. Given the word representation from the second layer and its cross-sentence context , the second attention adaptively weights the importance of the cross-sentence context for each word:



is the sigmoid function and

are parameters. Therefore, is the word representation equipped with relevant cross-sentence contexts. To further distribute the updated word representation within single sentence, we utilize a bidirectional LSTM similar to Eq(1) in the third layer called story encoding, which generates the contextual representation for each word. Finally, in the last layer, we convert

into the joint visual-semantic embedding space via a linear transformation:


where are parameters for the linear mapping. In such way, the is encoded with both single sentence and cross sentence context to retrieve images.

4.2. Dense Visual-Semantic Matching

In subsection 4.1, we represent sentence as a set of fine-grained word representations . Similarly, we can represent image as a set of fine-grained region representation in the common visual semantic space. In this work, the image region is detected via a bottom-up attention network (anderson2018bottom, ) pretrained on the VisualGenome dataset (krishna2017visual, ), so that each region represents an object, relation of object or scene.

Based on the dense representations of and and the similarity of each fine-grained cross-modal pair , we apply dense matching to compute the global sentence-image similarity as follows:


where the

is implemented as the cosine similarity in this work. The dense matching firstly grounds each word with the most similar image region and then averages all word-region similarity over words as the global similarity.

We employ contrastive loss to train the dense matching model, which is:


where is matched pair while

are mismatched pairs. The overall loss function is the average of

on all pairs in the training dataset.

After training, the dense matching model not only can retrieve relevant images for each sentence, but also can ground each word in the sentence to the most relevant image regions, which provides useful clues for the following rendering.

0:    Sentence ;Candidate image set ;
0:    Selected image sequence ;
1:  Divide into phrase chunks via constituency parsing, where is composed of sequential words;
2:  Computing phrase-image similarity based on the dense matching model ;
3:   and
4:  for ; ;  do
6:     if  then
7:        ;
9:        ;
10:     end if
11:  end for
12:  return  ;
Algorithm 1 Retrieving a complementary sequence of images to visualize one sentence in a greedy way.

4.3. One-to-Many Coverage Enhancement

In order to cover as much as details in the story, it is sometimes insufficient to only retrieve one image especially when the sentence is long. The challenge for such one-to-many retrieval is that we don’t have such training data, and whether multiple images are required is dependent on candidate images. Therefore, we propose a greedy decoding algorithm to automatically retrieve multiple complementary images to enhance the coverage of story contents.

It firstly segments the sentence into multiple phrase chunks via constituency parsing. The dense matching model can be used to compute phrase-image similarities for each chunk. Then we greedily select top-K images for each chunk because the top-K results are usually similar. If the top-K retrieved images for a new chunk haven’t been retrieved in previous chunks, it is necessary to visualize the chunk with additional images to cover more details in the sentence. Otherwise using multiple images can be redundant. In this greedy decoding way, we can automatically detect the necessity of one-to-many mapping and retrieve complementary images. The detailed algorithm is provided in Algorithm 1.

5. Storyboard Creator

Though retrieved image sequences are cinematic and able to cover most details in the story, they have the following three limitations against high-quality storyboards: 1) there might exist irrelevant objects or scenes in the image that hinders overall perception of visual-semantic relevancy; 2) images are from different sources and differ in styles which greatly influences the visual consistency of the sequence; and 3) it is hard to maintain characters in the storyboard consistent due to limited candidate images.

In order to alleviate above limitations, we propose the storyboard creator to further refine retrieved images to improve relevancy and consistency. The creator consists of three modules: 1) automatic relevant region segmentation to erase irrelevant regions in the retrieved image; 2) automatic style unification to improve visual consistency on image styles; and 3) a semi-manual 3D model substitution to improve visual consistency on characters.

(a) Original image.
(b) Dense match.
(c) Mask R-CNN.
(d) Fusion result.
Figure 3. The dense matching and Mask R-CNN models are complementary for relevant region segmentation.

5.1. Relevant Region Segmentation

Since the dense visual-semantic matching model grounds each word with a corresponding image region, a naive approach to erase irrelevant regions is to only keep grounded regions. However, as shown in Figure 3(b), although grounded regions are correct, they might not precisely cover the whole object because the bottom-up attention (anderson2018bottom, ) is not especially designed to achieve high segmentation quality. The current Mask R-CNN model (he2017mask, ) is able to obtain better object segmentation results. However it cannot distinguish the relevancy of objects and the story in Figure 3(c)

, and it also cannot detect scenes. Since these two methods are complementary to each other, we propose a heuristic algorithm to fuse the two approaches to segment relevant regions precisely.

For each grounded region by the dense matching model, we align it with a object segmentation mask from the Mask R-CNN model. If the overlap between the grounded region and the aligned mask is bellow certain threshold, the grounded region is likely to be relevant scenes. In such case, we keep the grounded region. Otherwise the grounded region belongs to an object and we utilize the precise object boundary mask from Mask R-CNN to erase irrelevant backgrounds and complete relevant parts. As shown in Figure 3(d), the proposed fusion method improves the separate processing model and overall image relevancy.

5.2. Style Unification

We explore style transfer techniques to unify the styles of retrieved image sequences. Specifically, we convert the real-world images into cartoon style images. On one hand, the cartoon style images maintain the original structures, textures and basic colors, which ensures the advantage of being cinematic and relevant. On the other hand, the cartoon style is more simplified and abstract than the realistic style, so that such cartoon sketch can improve visual consistency of the storyboard. In this work, we utilize a pretrained CartoonGAN (chen2018cartoongan, ) for the cartoon style transfer.

5.3. 3D Character Substitution

It is very challenging to maintain visual consistency on characters of the automatically created storyboard. Therefore, we propose a semi-manual way to address this problem, which involves manual assistance to improve the character coherency.

Since above created storyboard has presented good structure and organization of character and scenes, we refer to manual effort to mimic the visualization of automatic created storyboard by changing the scene and characters into some predefined templates. The templates are from a 3D software Autodesk Maya , which can be easily dragged and organized in the Maya software to construct a 3D picture. The human storyboard artist is asked to select proper templates to replace the original ones in the retrieved image. The character is placed according to its action and location in the retrieved image, which can greatly reduce the costly human design. After the placement of all characters and scenes, we can select a camera view to render the 3D model to 2D image. Although currently this step still requires some human efforts, we will make it more automatic in our future work.

6. Experiment

6.1. Training Datasets

MSCOCO: The MSCOCO (lin2014microsoft, ) dataset belongs to the DII type of training data. It consists of 123,287 images and each image is annotated with 5 independent single sentences. Since the MSCOCO cannot be used to evaluate story visualization performance, we utilize the whole dataset for training.

VIST: The VIST dataset is the only currently available SIS type of dataset. Following the standard dataset split (huang2016visual, ), we utilize 40,071 stories for training, 4,988 stories for validation and 5,055 stories for testing. Each story contains 5 sentences as well as the corresponding ground-truth images.

6.2. Testing Stories in the Wild

To evaluate the storyboard creation performance for stories in the wild, we collect 200 stories from three different sources including Chinese idioms, movie scripts and sentences in the GraphMovie plot explanations, which contain 579 sentences in total. For each sentence, we manually annotate its relevancy with the top 100 candidate images from the text-based retrieval as explained in subsection 3.3. An image is relevant to a sentence if it can visualize some parts of the sentence. We find that only 62.9% (364 in 579) sentences contain relevant images in the top 100 candidates by text retrieval, which demonstrates the difficulty of the task. There are 10.9% relevant images on average for those 62.9% sentences. We have released the testing dataset222

6.3. Experimental Setups

We combine the MSCOCO dataset and VIST training set for training, and evaluate the story-to-image retrieval model on VIST testing set and our GraphMovie testing set respectively. Since the GraphMovie dataset is in Chinese, we translate the training set from English to Chinese via an automatic machine translation API 333 In order to cover out-of-vocabulary words besides the training set, we fix the word embeddings with pretrained word vectors in the retrieval model, with Glove embedding (pennington2014glove, ) for English and fastText embedding (grave2018learning, ) for Chinese. The dimensionality of joint embedding space is set as 1024. Hard negative mining is applied to select negative training pairs within a mini-batch (faghri2018vse++, ) in Eq (7

) for MSCOCO dataset. However, since the sentences in the VIST stories are more abstract than caption descriptions in MSCOCO, one story sentence can be suitable for different images which makes the selected “hard negatives” noisy. Therefore, we average over all negatives for the VIST dataset to alleviate noisy negatives. The model is trained by Adam algorithm with learning rate of 0.0001 and batch size of 256. We train the model for 100 epochs and select the best one according to the performance on VIST validation set.

To make fair comparison with the previous work (ravi2018show, )

, we utilize the Recall@K (R@K) as our evaluation metric on VIST dataset, which measures the percentage of sentences whose ground-truth images are in the top-K of retrieved images. We evaluate under

and as in (ravi2018show, ). For the GraphMovie testing dataset, since the number of candidate images for each sentence is less than 100, we evaluate R@K with K=1, 5, and 10. We also evaluate on common retrieval metrics including median rank (MedR), mean rank (MeanR) and mean average precision (MAP).

R@10 R@50 R@100
CNSI (ravi2018show, ) 0 1.5 4.5
No Context (karpathy2015deep, ) 11.24 28.38 39.15
CADM w/o attn 12.98 32.84 44.47
CADM 13.65 33.91 45.53
Table 1. Story-to-image retrieval performance on the VIST testing set. All scores are reported as percentage (%).

6.4. Quantitative Results

We first evaluate the story-to-image retrieval performance on the in-domain dataset VIST. We compare our CADM model with two state-of-the-art baselines and one variant:

  • CNSI (ravi2018show, ): global visual semantic matching model which utilizes hand-crafted coherence feature as encoder.

  • No Context (karpathy2015deep, ): the state-of-the-art dense visual semantic matching model for text-to-image retrieval.

  • CADM w/o attn: variant of CADM model, which does not use attention to dynamically compute context but averages representations of other sentences as fixed context.

Table 1 presents the story-to-image retrieval performance of the four models on VIST testing set. The “No Context” model has achieved significant improvements over the previous CNSI (ravi2018show, ) method, which is mainly contributed to the dense visual semantic matching with bottom-up region features instead of global matching. The CADM model without attention can boost the performance of “No Context” model with fixed context, which demonstrates the importance of contextual information for the story-to-image retrieval. Our proposed CADM model further achieves the best retrieval performance because it can dynamically attend to relevant story context and ignore noises from context.

R@1 R@5 R@10 MedR MeanR MAP
Text 26.37 49.73 62.64 5.0 13.6 24.6
No Context 25.00 53.30 65.66 4.0 11.1 26.4
CADM 26.65 54.95 67.03 3.0 10.4 27.6
CADM+Text 32.97 61.26 74.73 2.0 7.9 31.8
Table 2. Story-to-image retrieval performance on the GraphMovie testing set. All scores are reported as percentage (%).

Then we explore the generalization of our retriever for out-of-domain stories in the constructed GraphMovie testing set. We compare the CADM model with the text retrieval based on paired sentence annotation on GraphMovie testing set and the state-of-the-art “No Context” model. As shown in Table 2, the purely visual-based retrieval models (No Context and CADM) improve the text retrieval performance since the annotated texts are noisy to describe the image content. Among visual-based retrieval models, the proposed CADM model also outperforms “No Context” model for out-of-domain stories, which further demonstrates the effectiveness of the proposed model even in more difficult scenarios. It achieves median rank of 3, which indicates there exist one relevant image in the top 3 retrieved ones for most story sentences. Our CADM model is also complementary with the text retrieval as seen in the last row of Table 2. When we fuse the visual retrieval scores and text retrieval scores with 0.9 and 0.1 weights respectively, we achieve the best performance on the GraphMovie testing set.

R@1 R@5 R@10 MedR MeanR MAP
Text 14.01 36.71 51.21 9.0 18.2 17.4
No Context 21.26 48.31 61.35 5.0 12.5 22.9
CADM 22.22 51.21 65.22 4.0 10.5 25.5
CADM+Text 25.60 51.21 67.15 4.0 9.9 26.5
Table 3. Story-to-image retrieval performance on subset of GraphMovie testing set that only includes Chinese idioms or movie scripts. All scores are reported as percentage (%).

Since the GraphMovie testing set contains sentences from text retrieval indexes, it can exaggerate the contributions of text retrieval. Therefore, in Table 3 we remove this type of testing stories for evaluation, so that the testing stories only include Chinese idioms or movie scripts that are not overlapped with text indexes. We can see that the text retrieval performance significantly decreases compared with Table 2. Nevertheless, our visual retrieval performance are almost comparable across different story types, which indicates that the proposed visual-based story-to-image retriever can be generalized to different types of stories.

good so-so bad
No Context 38.4 36.0 25.6
CADM 52.2 25.4 22.4
Table 4. Human evaluation performance on 30 selected Chinese idioms. All scores are reported as percentage (%).
Figure 4. Example of one story visualization of different retrieval models and rendering steps for an out-of-domain story. Images of different colors are corresponding to sentences with the same color in the story. Best viewed in colors.

6.5. Qualitative Results

Due to the subjectivity of the storyboard creation task, we further conduct human evaluation on the created storyboard besides the quantitative performance. We randomly select 30 Chinese idioms including 157 sentences for the human evaluation. The “No Context” retrieval model and the proposed CADM model are utilized to select top 3 images for each story respectively. We ask human annotators to classify the best of the top 3 retrieved images for each sentence into 3 level based on the visualization quality: good, so-so and bad. Each sentence is annotated by three annotators and the majority vote is used as the final annotation. Table 4 presents the human evaluation results of the two models. The CADM achieves significantly better human evaluation than the baseline model. More than 77.6% of sentences are visualized reasonably according to the user study (good and so-so).

In Figure 4, we provide the visualization of different retrievers and rendering processes of the proposed inspire-and-create framework for a Chinese idiom “Drawing cakes to fill the hunger”. The contextual-aware CADM retriever selects more accurate and consistent images than the baseline especially for the third sentence in this case. The proposed greedy decoding algorithm further improves the coverage of long sentences via automatically retrieving multiple complementary images from candidates. The relevant region segmentation rendering step erases irrelevant backgrounds or objects in the retrieved image, though there exist minor mistakes in the automatic results. And through style unification we can obtain more visually consistent storyboard. The last row is the manually assisted 3D model substitution rendering step, which mainly borrows the composition of the automatic created storyboard but replaces main characters and scenes to templates. The final human assisted version is nearly comparable to professional storyboard. We will make this step more automatic in our future work. In general, we managed to integrate the whole system into a commercial product that tells visual stories to children.

7. Conclusion

In this work, we focus on a new multimedia task of storyboard creation, which aims to generate a sequence of images to illustrate a story containing multiple sentences. We tackle the problem with a novel inspire-and-create framework, which includes a story-to-image retriever to select relevant cinematic images for vision inspiration and a creator to further refine images and improve the relevancy and visual consistency. For the retriever, we propose a contextual-aware dense matching model (CADM), which dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. For the creator, we propose two fully automatic rendering steps for relevant region segmentation and style unification and one semi-manual steps to substitute coherent characters. Extensive experimental results on in-domain and out-of-domain visual story datasets demonstrate the effectiveness of the proposed inspire-and-create model. We achieve better quantitative performance in both objective and subjective evaluation than the state-of-the-art baselines for storyboard creation, and the qualitative visualization further verifies that our approach is able to create high-quality storyboards even for stories in the wild. In the future, we plan to recognize the picturing angles of images and learn the way of placing them in movies and guide the neural storyboard artist to create more professional and intelligent storyboards.

The authors would like to thank Qingcai Cui for cinematic image collection, Yahui Chen and Huayong Zhang for their efforts in 3D character substitution. This work was supported by National Natural Science Foundation of China (No. 61772535), Beijing Natural Science Foundation (No. 4192028), and National Key Research and Development Plan (No. 2016YFB1001202).


  • [1] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In

    International Conference on Machine Learning

    , pages 1060–1069, 2016.
  • [2] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 5907–5915, 2017.
  • [3] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1316–1324, 2018.
  • [4] Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789–1798. ACM, 2017.
  • [5] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization. 2019.
  • [6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [8] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • [9] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. 2018.
  • [10] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 299–307, 2017.
  • [11] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  • [12] Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, and Mubbasir Kapadia. Show me a story: Towards coherent neural story illustration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7613–7621, 2018.
  • [13] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.
  • [14] George Lakoff. Explaining embodied cognition results. Topics in cognitive science, 4(4):773–785, 2012.
  • [15] Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. Beyond narrative description: Generating poetry from images by multi-adversarial training. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 783–791. ACM, 2018.
  • [16] Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, and Jianlong Fu. Emotion reinforced visual storytelling. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 297–305. ACM, 2019.
  • [17] Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2018.
  • [18] Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6199–6208, 2018.
  • [19] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
  • [20] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5005–5013, 2016.
  • [21] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6163–6171, 2018.
  • [22] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014.
  • [23] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
  • [24] Gunhee Kim, Seungwhan Moon, and Leonid Sigal. Ranking and retrieval of image sequences from multiple paragraph queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1993–2001, 2015.
  • [25] Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233–1239, 2016.
  • [26] Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776, 2018.
  • [27] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. In Advances in Neural Information Processing Systems, pages 10052–10062, 2018.
  • [28] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [29] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.

    Bottom-up and top-down attention for image captioning and visual question answering.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
  • [30] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [31] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [32] Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9465–9474, 2018.
  • [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [34] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1532–1543, 2014.
  • [35] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.