StoryGAN: A Sequential Conditional GAN for Story Visualization

12/06/2018 ∙ by Yitong Li, et al. ∙ 0

In this work we propose a new task called Story Visualization. Given a multi-sentence paragraph, the story is visualized by generating a sequence of images, one for each sentence. In contrast to video generation, story visualization focuses less on the continuity in generated images (frames), but more on the global consistency across dynamic scenes and characters -- a challenge that has not been addressed by any single-image or video generation methods. Therefore, we propose a new story-to-image-sequence generation model, StoryGAN, based on the sequential conditional GAN framework. Our model is unique in that it consists of a deep Context Encoder that dynamically tracks the story flow, and two discriminators at the story and image levels, respectively, to enhance the image quality and the consistency of the generated sequences. To evaluate the model, we modified existing datasets to create the CLEVR-SV and Pororo-SV datasets. Empirically, StoryGAN outperformed state-of-the-art models in image quality, contextual consistency metrics, and human evaluation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to generate meaningful and coherent sequences of images from a natural language story is a challenging task that requires understanding and reasoning on both natural language and images. In this work, we propose a new Story Visualization task. Specifically, the goal is to generate a sequence of images to describe a story written in a multi-sentence paragraph as shown in Figure 1.

Figure 1: The input story is “Pororo and Crong are fishing together. Crong is looking at the bucket. Pororo has a fish on his fishing rod.” Each sentence is visualized with one image. In this work, the image generation for each sentence is enriched with contextual information from the Context Encoder. Two discriminators at different levels guide the generation process.

There are two main challenges in this task. First, the sequence of images must consistently and coherently depict the whole story. This task is highly related to text-to-image generation [33, 26, 15, 34, 32], where an image is generated based on a short description. However, by sequentially applying text-to-image methods to a story will not generate a coherent image sequence, failing on the story visualization task. For instance, consider the story “A red metallic cylinder cube is at the center. Then add a green rubber cube at the right.” The second sentence alone does not capture the entire scene.

The second challenge is how to display the logic of the storyline. Specifically, the appearance of objects and the layout in the background must evolve in a coherent way as the story progresses. This is similar to video generation. However, story visualization and video generation differ as: (i) Video clips are continuous with smooth motion transitions, so video generation models focus on extracting dynamic features to maintain realistic motions [30, 29]. In contrast, story visualization’s goal is to generate a sequence of key static frames that present correct story plots where motion features are less important. (ii) Video clips are often based on simple sentence input and typically have a static background, while complex stories require the model to capture scenery changes necessary for the plot line. In that sense, story visualization could also be viewed as a critical step towards real-world long-video generation by capturing sharp scene changes. To tackle these challenges, we propose a StoryGAN framework, inspired by Generative Adversarial Nets (GANs) [8], a two-player game between a generator and a discriminator. To take into account the contextual information in the sequence of sentence inputs, StoryGAN is designed as a sequential conditional GAN model.

Given a multi-sentence paragraph (story), StoryGAN uses a recurrent neural network (RNN) to incorporate the previously generated images into the current sentence’s image generation. Contextual information is extracted with our Context Encoder module, including a stack of a GRU cell and our newly proposed Text2Gist cell. The Context Encoder transforms the current sentence and a story encoding vector into a high-dimensional feature vector (Gist) for further image generation. As the story proceeds, the Gist is dynamically updated to reflect the change of objects and scenes in the story flow. In the Text2Gist component, the sentence description is transformed into a filter and adapted to the story, so that we can optimize the mixing process by tweaking the filter. Similar ideas are also used in dynamic filtering 

[16]

, attention models 

[32] and meta-learning [25].

To ensure consistency across the sequence of the generated images, we adopt a two-level GAN framework. We use an image-level discriminator to measure the relevance of a sentence and its generated image, and a story-level discriminator to measure the global coherence between the generated image sequence and the whole story.

We created two datasets from the existing CLEVR [17] and Pororo [19] datasets for our story visualization task, called CLEVR-SV and Pororo-SV, respectively. Empirically, StoryGAN more efficiently captures the full picture of the story and how it evolves than existing baselines [34, 22]. Equipped with the deep Context Encoder module and the two-level discriminators, StoryGAN significantly outperforms previous state-of-the-art models, generating a sequence of higher quality images that are coherent with the story in both image quality and global consistency metrics as well as human evaluation.

2 Related Work

Variational AutoEncoders (VAEs)  

[21], Generative Adversarial Nets (GANs) [8], and flow-based generative models [6, 7]) have been widely applied to a wide range of generation tasks including text-to-image generation, video generation, style transfer, and image editing. Story visualization falls into this broad categorization of generative tasks, but has several distinct aspects.

Very relevant for story visualization task is conditional text-to-image transformation [26, 15, 36, 33], which can now generate high-resolution realistic images [34, 32]. A key task in text-to-image generation is to understand longer and more complex input texts. For example, this has been explored in dialogue-to-image generation, where the input is a complete dialogue session rather than a single sentence  [27]. Another related task is that of textual image editing, which edits an input image according to a textual editing query  [3, 28]. This task requires consistency between the original image and the output image. Finally, there is the task of placing pre-specified images and objects in a picture from a text description  [18]. This task also relates text to a consistent image, but does not require a full image generation procedure.

A second closely related task to story visualization is that of video generation, especially that of text-to-video [22, 11] or image-to-video generation [1, 29, 30]. Existing approaches only generate short video clips [11, 4, 10] without scene changes. The biggest challenge in video generation is how to ensure a smooth motion transition across successive video frames. To this end, researchers disentangle dynamic and static features for motion and background, respectively [30, 22, 29, 5]. In our modeling of story visualization, the whole story sets the static features and each input sentence encodes dynamic features. However, there are several differences: conditional video generation has only one input, while our task has a sequential, evolving input; and the motion in video clips is continuous in a single scene, while the generated images that visualizes a story are discrete and often represent different scenes. For the form of the conditional information, it is not limited to text. Trajectory, skeleton or simple landmark based video generation has been studied as in several existing works [10, 35, 31].

There are also several other related tasks in the literature. For example, we can consider story illustration by retrieving images from a pre-collected training set rather than image generation [24]. Cartoon generation has been explored with “cut and paste” technique [9]

. However, both of these techniques require large amounts of labeled training data. An inverse task to story visualization is visual storytelling, where the output is a paragraph describing a sequence of input images. Text generation models or reinforcement learning are often highlighted for visual storytelling 

[14, 23, 13].

3 StoryGAN

Figure 2: The framework of StoryGAN. The variables in gray solid circles are the input story and individual sentences with random noise . The generator network contains the Story Encoder, Context Encoder and image generator. The proposed component Text2Gist is introduced in detail in Section 3.2. There are two discriminators on top, which discriminate whether each image-sentence pair and each image-sequence-story pair are real or fake.

StoryGAN is designed to create a sequence of images to describe an input story . The story consists of a sequence of sentences , where the length may vary. There is one generated image per sentence, denoted as , that are both locally (sentence-image) and globally (story-images) consistent. For training, ground truth images are denoted as . The image sequence is locally consistent if each image matches semantically its corresponding sentence. The image sequence is globally consistent if all the images globally hold together as coherent to the full story it visualizes. In our approach, each sentence in story has been encoded into an embedding vector using a pre-trained sentence encoder [2]. With slight abuse of notation, each sentence is an encoded vector . In the following, we assume and are both encoded feature vectors instead of raw text.

The overall architecture of StoryGAN is presented in Figure 2. It is implemented as a sequential GAN model, which consists of (i) a Story Encoder that encodes into a low-dimensional vector ; (ii) a two-layer recurrent neural network (RNN) based Context Encoder that encodes input sentence and its contextual information into a vector (Gist) for each time point ; (iii) an image generator that generates image based on for each time step ; and (iv) an image discriminator and a story discriminator that guide the image generation process so as to ensure the generated image sequence is locally and globally consistent, respectively.

The image generator follows a standard deep CNN structure, so our focus is on the Story Encoder, the Context Encoder, the image and story discriminators, and the way StoryGAN is trained.

3.1 Story Encoder

The Story Encoder is given in the dotted pink box of Figure 2. Following the conditioning mechanism in StackGAN [34], the Story Encoder learns a stochastic mapping from story to an low-dimensional embedding vector . encodes the whole story and it serves as the initial state of the hidden cell of the Context Encoder. Specifically, the Story Encoder samples an embedding vector

from a normal distribution

, with and implemented as two neural networks. In this work, we restrict to a diagonal matrix for computational tractability. With the reparameterization trick, the encoded story can be written as , where . represents elementwise multiplication, and the square root is also taken elementwise. and

are parameterized as Multi-Layer Perceptrons (MLPs) with a single hidden layer. Convolutional networks could also be used depending on the structure of

. The sampled is provided to the RNN-based Context Encoder as the initial state vector.

By using stochastic sampling, the Story Encoder deals with the discontinuity problem in the original story space, thus not only leading to a compact, semantic representation of for story visualization, but also adds randomness to the generation process. The encoder’s parameters are optimized jointly with the other modules of StoryGAN via back propagation. Therefore, to enforce the smoothness over the conditional manifold in latent semantic space and avoid collapsing to a single generative point rather than a distribution, we add the regularization term [34],

(1)

which is the Kullback-Leibler (KL) divergence between the learned distribution and the standard Gaussian distribution.

3.2 Context Encoder

Video generation is closely related to story visualization and it typically assumes a static background with smooth motion transitions, requiring a disjoint embedding of static and dynamic features [30, 14, 29]. In story visualization, the challenge differs in that the characters, motion, and background often change from image to image, as illustrated in Figure 1. This requires us to address two problems: (i) how to update the contextual information to effectively capture background changes and (ii) how to combine new inputs and random noise when generating each image to visualize the change of characters, which may dramatically shift.

We address these issues by proposing a deep RNN based Context Encoder to capture contextual information during sequential image generation, shown in the red box in Figure 2. The context can be defined as any related information in the story that is useful for the current generation. The deep RNN consists of two hidden layers. The lower layer is implemented using standard GRU cells and the upper layer using the proposed Text2Gist cells, which are a variant to GRU cells and are detailed below. At a time step , the GRU layer takes as input the concatenation of the sentence and isometric Gaussian noise and outputs the vector . The Text2Gist cell combines the GRU’s output with the story context (initialized by Story Encoder) to generate that encodes all necessary information for generating an image at time . is updated by the Text2Gist cell to reflect the change of potential context information.

Let and denote the hidden vectors of the GRU and Text2Gist cells, respectively. The Context Encoder works in two steps to generate its output:

(2)
(3)

We call the “Gist” vector since it combines all the global and local context information, from and respectively, at time step (i.e. it captures the “gist” of the information). The Story Encoder initializes , while is randomly sampled from an isometric Gaussian distribution.

Next, we give the underlying updates of Text2Gist. Given and at time step , Text2Gist generates a hidden vector and an output vector as follows:

(4)
(5)
(6)
(7)

where and are the outputs from the update gate and reset gate, respectively. is element-wise multiplication. The update gate decides how much information from previous step should be kept and the reset gate determines what to forget from . , and are sigmoid non-linearity functions. In contrast to standard GRU cells, output is the convolution between and . The filter is learned to adapt to . Specifically, transforms vector to a multi-channel filter of size using a neural network, where is the number of output channels. Since is a vector, this filter is used as a 1D filter as in a standard convolutional layer.

The convolution operator in Eq. (7) infuses the global contextual information from and local information from . is the output of the Text2Gist cell at time step . Since encodes information from and from , which reflects the whole picture of the story, the convolutional operation in Eq. (7) can be seen as helping to pick out the important part from the story in the process of generation. Empirically, we find that Text2Gist is more effective than traditional RNNs for story visualization.

3.3 Discriminators

StoryGAN uses two discriminators, an image and a story discriminator, to ensure the local and global consistency of the story visualization, respectively. The image discriminator measures whether the generated image matches the sentence given its initial context information encoded in . It does this by comparing the generated triplet to the real triplet . In contrast to prior work on text-to-image generation [34, 26], the same sentence can have a drastically different generated image depending on the context, so it is important to give the encoded context information to the discriminator as well. For example, consider the example given in Section 1, “A red metallic cylinder cube is at the center. Then add a green rubber cube at the right of it.” The second image will vary wildly without the context (i.e. the first sentence).

Figure 3: Structure of the story discriminator. The feature vectors of the images/sentences in the story are concatenated. means elementwise product. The product of image and text features are input to a fully connected layer with sigmoid non-linearity to predict whether it is a fake or real story pair.

The story discriminator helps enforce the global consistency of the generated image sequence given story . It differs from the discriminators used for video generation, which often use 3D convolution [30, 29, 22] to smooth the changes between frames. The overall architecture of the story discriminator is illustrated in Figure 3. The left part is an image encoder, which encodes an image sequence into a sequence of feature vectors , where are either real or generated images (which are denoted by ). These vectors are concatenated into a single vector, shown as the blue rectangular in Fig. 3. Similarly, the right part is a text encoder, which encodes the multi-sentence story into a sequence of feature vectors . Likewise, these are concatenated into one big vector, shown as the red rectangle in Fig. 3. The image encoder is implemented as a deep convolutional network and the text encoder as a multi-layer perceptron. Both output a same dimensional vector.

The global consistency score is computed as

(8)

where is element-wise product. The weights and bias are learned in the output layer.

is a sigmoid function that normalizes the score to a value in

. By pairing each sentence and image, the story discriminator can consider both local matching and global consistency jointly.

Both image and story discriminators are trained on positive and negative pairs. The latter are generated by replacing the image (sequence) in the positive pairs with generated ones.

3.4 Algorithm Outlines

Let , , and denote the parameters of the whole generator , the image discriminator, and the story discriminator, respectively. The objective function for StoryGAN is

(9)

where and balance the three loss terms. is the regularization term of the Story Encoder defined in (1). and are defined as

(10)
(11)

and are the image discriminator and story discriminator, parameterized by and , respectively.

The pseudo-code for training StoryGAN is given in Algorithm 1. The parameters of the story and image discriminators, and , are updated in two separate for loops, respectively, while the parameters of the image generator are updated in both loops. The initial hidden state of the Text2Gist layer is the encoded story feature vector produced by the Story Encoder. The detailed configuration of the network is provided in Appendix A.

  Input: Encoded sentence vectors and corresponding images for .
  Output: Generator parameters and discriminator parameters and .
  for  to  do
     for  = 1 to  do
        Sample a mini-batch of story-sentence pairs from the training set.
        Compute as the initialization of the Text2Gist layer and the KL regularization term as Eq. (1).
        Generate a single output image .
        Update and .
     end for
     for  = 1 to  do
        Sample a mini-batch of story-image pair from training set.
        Compute and update at each time step
        Generate image sequence .
        Update and .
     end for
  end for
Algorithm 1 Training Procedure of StoryGAN

In our experiments, we use Adam [20]

for parameter updates. We also find that using different mini-batch sizes for image and story discriminators may accelerate training convergence, and that it is beneficial to update generator and discriminator in different time steps in one epoch.

4 Experiment

In this section, we evaluate the StoryGAN model on one toy and one cartoon dataset. To the best of our knowledge, there is no existing work on our proposed story visualization task. The closest alternative for story visualization is conditional video generation [22], where the story is treated as single input and a video is generated in lieu of the sequence of images. However, we empirically found that the video generation result is too blurry and not comparable to StoryGAN. Thus, our comparisons are mainly to ablated versions of our proposed model. For a fair comparison, all models use the same structure of the image generator, Context Encoder and discriminators when applicable. The compared baseline models are:

ImageGAN: ImageGAN follows the work in [26, 34] and does not use the story discriminator, story encoder and Context Encoder. Each image is generated independently. However, for a reasonable comparison, we concatenate , encoded story and a noise term as input. Otherwise, the model fails on the task. This is the simplest version of StoryGAN.

SVC: In “Story Visualization by Concatenation” (SVC), the Text2Gist cell in StoryGAN is replaced by simple concatenation of the encoded story and description feature vectors [29]. Compared to ImageGAN, SVC includes the additional story discriminator, and is visualized in Figure 4.

Figure 4: The framework of the baseline model SVC, where the story and individual sentence are concatenated to form the input.

SVFN: In “Story Visualization by Filter Network” (SVFN), the concatenation in SVC is replaced by a filter network. Sentence is transformed into a filter and convolved with the encoded story. Specifically, the image generator input is instead of Eq. 7.

4.1 CLEVR-SV Dataset

Figure 5: Comparison among different methods on CLEVR-SV dataset.

The CLEVR [17] dataset was originally used for visual question answering. We modified this data for story visualization by generating images from randomly assigned layouts of the object (examples in the top row of Figure 5). We named this dataset CLEVR-SV to distinguish it from the existing CLEVR dataset. Specifically, four rules were used to construct the CLEVR-SV: The maximum number of objects in one story is limited to four. It only contains objects made of metallic/rubber with eight different colors and two different sizes. The object shape can be a cylinder, a cube or a sphere. The object is added one at a time, resulting in a four-image sequence per story. We generated image sequences for training and for testing. For our task, the story is the layout descriptions of objects.

The input is the current object’s attribute and the relative position given by two real numbers indicating its coordinates. For instance, the first image of the left column of Fig. 5 is generated from “yellow, large, metal, sphere, (-2.1, 2.4).” The following objects are described in the same way. Given the description, the generated objects’ appearance should have little variation from the ground truth and their relative positions should be similar.

Figure 5 shows the generated results from the different competing methods. ImageGAN [26] fails to keep the consistency of the ‘story’ and it mixes up the attributes when the number of objects increases. SVC solves this consistency problem by including the story discriminator and GRU cell at the bottom, as the third row of Figure 5 has consistent objects in the image sequence. However, SVC generates an implausible forth image in the sequence. We hypothesize that using simple vector concatenation cannot effectively balance the importance of the current description with the whole story. SVFN can alleviate this problem to some extent, but not completely.

In contrast, StoryGAN generates more feasible images than the competitors. We attribute the performance improvement to three components: Text2Gist cell tracks the progress of story; story and image discriminators keep the consistency of objects in the generation process; using the Story Encoder to initialize the Text2Gist cell gives better result on first generated image. Greater empirical evidence for this final point appears in the cartoon dataset in Section 4.2.

In order to further validate the StoryGAN model, we designed a task to evaluate whether the model can generate consistent images by changing the first sentence description. Specifically, we randomly replaced the first object’s description while keeping the other three the same during generation, which we visualize in Supplemental Figure 8 in Appendix B. This comparison shows that only StoryGAN can keep the story consistency by correctly utilizing the attributes of the first object in later frames, as discussed above. In Supplemental Figure 9, we give additional examples on changing the initial attributes only using StoryGAN. Regardless of the initial attribute, StoryGAN is consistent between frames.

ImageGAN [26] SVC SVFN StoryGAN
SSIM 0.596 0.641 0.654 0.672
Table 1: SSIM comparison on CLEVR-SV dataset.

We also compare the Structural Similarity Index (SSIM) score between the generated images and ground truth [12]. SSIM was originally used to measure the recovery result from distorted images. Here, it is used to determine whether the generated images are aligned with the input description. Table 1 gives the SSIM metric for each method on the test set. Note that though this is a generative task, using SSIM to measure the structure similarity is reasonable because there is little variation given the description. In this task, StoryGAN significantly outperforms the other baselines.

4.2 Cartoon Dataset

Figure 6: Two generated samples on the Pororo-SV dataset.

The Pororo dataset [19] was originally used for video question answering, where each one second video clip is associated with more than one manually written description. About video clips forms a complete story. Each story has several QA pairs. In total, the Pororo dataset contains K clips of one second videos about distinct characters. The manually written description has an average length of words that describes what is happening and which characters are in each video clip. These K video clips are sorted into movie stories [19].

We modified the Pororo dataset to fit story visualization task by considering the description for each video clip as the story’s text input. For each video clip, we randomly pick out one frame (sampling rate is 30Hz) during training as the real image sample. Five continuous images form a single story. Finally, we end up with description-story pairs, where pairs are used as training, the remaining pairs for testing. We call this dataset Pororo-SV to differ it from the original Pororo QA dataset [19].

The text encoder uses universal encoding [2] with fixed pre-trained parameters. Training a new text encoder empirically gave little performance gain. Two visualized stories from the competing methods are given in Figure 6. The text input is given on the top. ImageGAN does not generate consistent image sequences; for instance, the generated images switch from indoors to outdoors randomly. Additionally, the characters’ appearance is inconsistent in the sequence of images (e.g. Pororo’s hat). SVC and SVFN can improve the consistency to some extent, but their limitations can be seen in the unsatisfactory first images. In contrast, StoryGAN’s first image has a much higher quality than other baselines because of the use of the Story Encoder to initialize the recurrent cell. This shows the advantage of using the output of the Story Encoder as first hidden state over random initialization.

To explore how the different models represent the story, we ran experiments where only the character names in the story were changed, shown in Figure 7. Visually, StoryGAN outperforms the other baselines on the image quality and consistency.

Figure 7: Generation result by changing character names in the same story. The story template is given at the top, with the character names , and in the two instances of the story, each one shown in a column.

In order to test the model quantitatively, we perform two distinct tasks. The first task is designed to determine whether the generation is able to capture the relevant characters in the story. The nine most common characters are selected from the dataset. Their names and pictures are provided in Supplemental Figure 9 in Appendix D

. Next, a character image classifier is trained on real images from training set and applied on both real and generated images from the test set. We compare the classification accuracy (only exact matches across all characters counts as correct) of each image/story pair as an indicator of whether the generation is coherent to the story description. The classifier’s performance on the test set is 86%, which is considered an upper bound for the task. From these results, it is clear that StoryGAN has increased character consistency compared to the baseline models. Note that there is peculiarity in the labels, as the human labeled description can sometimes include characters not shown in the frame. However, this should hurt all algorithms equally and it is a fair comparison.

Upper Bound ImageGAN [26] SVC SVFN StoryGAN
Acc. 0.86 0.23 0.21 0.24 0.27
Table 2: Character classification accuracy (exact match ratio) comparison on Pororo-SV dataset. The upper bound is the classifier accuracy on the real images associated with the stories.

Human Evaluation

Automatic metrics cannot fully evaluate the performance of StoryGAN. Therefore, we performed both pairwise and ranking-based human evaluation studies on Amazon Mechanical Turk on Pororo-SV. For both tasks, we use 170 generated image sequences sampled from the test set, each assigned to 5 workers to reduce human variance. The order of the options within each assignment is shuffled to make a fair comparison.

We first performed a pairwise comparison between StoryGAN and ImageGAN. For each input story, the worker is presented with two generated image sequences and asked to make decisions from the three aspects: visual quality111The generated images look visually appealing, rather than blurry and difficult to understand., consistency222The generated images are consistent with each other, have a common topic hidden behind, and naturally forms a story, rather than looking like 5 independent images., and relevance333The generated image sequence accurately reflects the input story and covers the main characters mentioned in the story.. Results are summarized in Table 3

. The standard error on these estimates is small, demonstrating that StoryGAN drastically outperformed ImageGAN on this task.

We next performed ranking-based human evaluation. For each input story, the worker is asked to rank images generated from the four compared models on their overall quality. Results are summarized in Table 4. StoryGAN achieves the highest average rank, while ImageGAN performs the worst. Once again, there is little uncertainty in these estimates, so we are confident that humans prefer StoryGAN on average.

StoryGAN vs ImageGAN
Choice (%) StoryGAN ImageGAN Tie
Visual Quality 74.17 1.38 18.60 1.38 7.23
Consistence 79.15 1.27 15.28 1.27 5.57
Relevance 78.08 1.34 17.651.34 4.27
Table 3: Results of pairwise human evaluation. The denotes standard error on the metrics.
Method ImageGAN SVC SVFN StoryGAN
Rank 2.910.05 2.420.04 2.770.04 1.940.05
Table 4: Results of ranking-based human evaluation. The denotes standard error on the metrics.

5 Conclusion

We studied the story visualization task as a sequential conditional generation problem. The proposed StoryGAN model deals with the task by jointly considering the current input sentence with the contextual information. This is achieved by the proposed Text2Gist component in the Context Encoder, where the gist reflects the whole story. From the ablation test, the two-level discriminator and the recurrent structure on the inputs helps ensure the consistency across the generated images and the story to be visualized, while the Context Encoder efficiently provides the image generator with both local and global conditional information. Both quantitative and human evaluation studies show that StoryGAN improves the generation compared to the baseline models. As image generators improve, the story visualization’s quality will improve also.

References

Appendix A Network Configuration

This section gives the network structure used in StoryGAN. In the following, ‘CONV’ means the 2D convolutional layer, which is configured by output channel number ‘C’, kernel size ‘K’, step size ‘S’ and padding size ‘P’. ‘LINEAR’ is fully connected layer, with input and output dimensions given in the parenthesis. Note that the Filter Network is contained in the Text2Gist cell, which transforms

to a filter. This is introduced in detail in section 3.2.

Layer Story Encoder
1 LINEAR-(128

T, 128), BN, RELU

Layer Context Encoder
1 LINEAR-(NOISEDIM + TEXTDIM, 128), BN, RELU
2 GRU-(128, 128)
3 Text2Gist-(128, 128)
Layer Filter Network
1 LINEAR-(128, 1024), BN, TANH
2 RESHAPE(16, 1, 1, 64)
Layer Image Generator
1 CONV-(C512, K3, S1, P1), BN, RELU
2 UPSAMPLE-(2,2)
3 CONV-(C256, K3, S1, P1), BN, RELU
4 UPSAMPLE-(2,2)
5 CONV-(C128, K3, S1, P1), BN, RELU
6 UPSAMPLE-(2,2)
7 CONV-(C64, K3, S1, P1), BN, RELU
8 UPSAMPLE-(2,2)
9 CONV-(C3, K3, S1, P1), BN, TANH
Layer Image Discriminator
1 CONV-(C64, K4, S2, P1), BN, LEAKY RELU
2 CONV-(C128, K4, S2, P1), BN, LEAKY RELU
3 CONV-(C256, K4, S2, P1), BN, LEAKY RELU
4 CONV-(C512, K4, S2, P1),BN, LEAKY RELU
5* CONV-(C512, K3, S1, P1), BN, LEAKY RELU
6 CONV-(C1, K4, S4, P0), SIGMOID
Layer Story Discriminator (Image Encoder)
1 CONV-(C64, K4, S2, P1), BN, LEAKY RELU
2 CONV-(C128, K4, S2, P1), BN, LEAKY RELU
3 CONV-(C256, K4, S2, P1), BN, LEAKY RELU
4 CONV-(C512, K4, S2, P1),BN, LEAKY RELU
5 CONV-(C32, K4, S2, P1),BN, CONCAT
6 RESHAPE-(1, 32 4 T )
Layer Story Discriminator (Text Encoder)
1 LINEAR-(128 T, 32 4 T ), BN
Table 5: Network Structure used in StoryGAN. * This layer combines the conditional input and the encoded images.

Appendix B More Examples of CLEVR-SV Dataset

Here we perform the test by using the same attributes of the first object. We test if the models can keep the first object consistent through the following generations. Figure 8 compares the results from different models. Figure 9 gives more samples using StoryGAN.

Figure 8: Method comparison on a task where the original story description is changed in the first sentence. Specifically, the first sentence is now “Large, Rubber, Cyan, Cylinder, at (-0.46, -0.36).” Each column corresponds to one layout of the following three objects. The first row is the original image that will be modified. Note that only StoryGAN keeps the story consistency amongst the compared methods.
Figure 9: An additional example using the same idea as Figure 8. The top row gives two initial setups. The next three rows correspond to StoryGAN generations with different first sentences. For the left column, the attributes of the first object are:‘Small, Metal, Cyan, Cylinder, at (-2.00, 0.02)’ (original), ‘Small, Metal, Purple, Cylinder, at (-1.15, -0.26)’, ‘Small, Metal, Yellow, Cylinder, at (0.35, 2.00)’ and ‘Large, Rubber, Gray, Cylinder, at (-1.77, -0.07)’, respectively. For the right column, the attributes of the first object are: ‘Large, Metal, Red, Sphere, at (0.52, 0.56)’ (original), ‘Large, Rubber, Brown, Sphere, at (-1.54, 0.85) ’, ‘Small, Metal, Yellow, Cylinder, at (-0.85, 2.29) ’ and ‘Large, Rubber, Blue, Cube, at (0.15, -0.19) ’, respectively. Again, we omit the attribute input of the second, third and forth objects to save the space. Note that regardless of the initial description, StoryGAN effectively captures the story consistency.

Appendix C Significance Test on Pororo-SV Dataset

We perform pairwise t-test on the human evaluated ranking test. As we can see, StoryGAN is statistically significant over other baseline models.

Method ImageGAN SVC SVFN StoryGAN
ImageGAN 1.0 5e-13 0.04 1e-40
SVC 5e-13 1.0 1e-8 4e-14
SVFN 0.04 1e-8 1.0 3e-36
StoryGAN 1e-40 4e-14 3e-36 1.0
Table 6: p-value on the human evaluated ranking test.

Appendix D Characters Photo and More Examples of Pororo-SV Dataset

For the classification accuracy compared in Table 2, nine characters are selected: ‘Pororo’, ‘Crong’, ‘Eddy’, ‘Poby’, ‘Loopy’, ‘Petty’, ‘Harry’, ‘Rody’ and ‘Tongtong’. Profile pictures of them are given in Figure 10.

Figure 10: Main character names and corresponding photos from the dataset.

Then, more generated samples on Pororo-SV dataset are given in Figure 11.

Figure 11: More samples on Pororo-SV test set. For simplicity, we give the ground truth story images instead of the raw story text. The left five columns are generated images. The right five columns are ground truth. Note that there is no need for the generation to exactly match the ground truth.