A Study of Cross-domain Generative Models applied to Cartoon Series

09/29/2017 ∙ by Eman T. Hassan, et al. ∙ Indiana University Bloomington 0

We investigate Generative Adversarial Networks (GANs) to model one particular kind of image: frames from TV cartoons. Cartoons are particularly interesting because their visual appearance emphasizes the important semantic information about a scene while abstracting out the less important details, but each cartoon series has a distinctive artistic style that performs this abstraction in different ways. We consider a dataset consisting of images from two popular television cartoon series, Family Guy and The Simpsons. We examine the ability of GANs to generate images from each of these two domains, when trained independently as well as on both domains jointly. We find that generative models may be capable of finding semantic-level correspondences between these two image domains despite the unsupervised setting, even when the training data does not give labeled alignments between them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 11

page 13

page 14

page 15

page 16

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Filmmakers and authors may not wish to admit it, but almost every story – and almost every work of literature and art in general – borrows heavily from those that came before it. Sometimes this is explicit: the 2004 movie Phantom of the Opera is a film remake of the famous Andrew Lloyd Webber musical, which was based on an earlier 1976 musical, which in turn was inspired by the 1925 silent film, all of which are based on the 1910 novel by Gaston Leroux. Sometimes the borrowing is for humor – the TV sitcom Modern Family’s episode Fulgencio was a clear parody of The Godfather, for example – or for political expression, such as The Onion’s satirized version of news stories. Even highly original stories still share common themes and ingredients, like archetypes for characters [10] (“the tragic hero,” “the femme fatal,” etc.) and plot lines [2] (“rags to riches,” “the quest,” etc.).

Training (actual) frames  Novel frames
… 
… 
Figure 1: We apply Generative Adversarial Networks (GANs) to model the styles of two different cartoon series, Family Guy (top) and The Simpsons (bottom), exploring their ability to generate novel frames and find semantic relationships between the two domains.

The fact that stories inspire one another means that the canon of film is full of similarities and latent connections between different works. These connections are sometimes obvious and sometimes subtle and highly debatable. To what extent can computer vision find these connections automatically, based on visual features alone?

As a starting point, here we explore the ability of Generative Adversarial Networks (GANs) to model the style of long-running television series. GANs have shown impressive results in a wide range of image generation problems, including infilling [28]

, automatic colorization 

[3]

, image super-resolution 

[19], and video frame prediction [23, 37]. Despite this work, many questions remain about when GAN models work well. One problem is that it is difficult to evaluate the results of these techniques objectively, or to understand the mechanisms and failure modes under which they operate. Various threads of work are underway to explore this, including network visualization approaches (e.g. [41]) and attempts to connect deep networks with well-understood formalisms like probabilistic graphical models (e.g. [16]). Another approach is to simply apply GANs to various novel case studies that may gave new insight into when they work and when they fail, as we do here. While predicting frames from individual videos has been studied [37], joint models of entire series may offer new insights. TV series usually include a set of key characters that are prominently featured in almost every episode, and a supporting set of characters that may appear only occasionally. Most series feature canonical recurring backgrounds or scenes (e.g. the coffee shop in Friends), along with others that occur rarely or even just once.

In this paper, we consider the specific case of television cartoon series. We use TV cartoons because they are more structured and constrained than natural images: they abstract out photo-realistic details in order to focus on high-level semantics, which may make it easier to understand the model learned by a network. Nevertheless, different cartoon series are significantly different in the appearance of characters, color schemes, background sets, artistic designs, etc. In particular, we consider two specific, well-known TV cartoon series: Simpsons and Family Guy. As our dataset, we sampled about 32,000 frames from four TV seasons (about 80 videos) from each series (Figure 1).

We use these two different cartoon series to explore several questions. To what extent can a network generate novel frames in the style of a particular series? Can a network automatically discover high-level semantic mappings between our two cartoon series, finding similar styles, scenes, and themes? If we have a frame in one series, can we find similar high-level scenes (e.g. “people shaking hands”) in the other? Can training from two different series be combined to generate better frames from each individual series? We find, for example, that training both domains together can generate better high-resolution images than either independently. Our work follows others that have considered mapping problems like image style transfer [12, 45] and text-to-image mappings [32, 8]. Our three contributions are: (1) proposing cartoon series as a fun, useful test domain for GANs, (2) building a structured but highly nontrivial mapping problem that reveals interesting insights about the latent space produced by the GAN, (3) presenting extensive experimentation where we vary the training dataset composition and generation techniques in order to study what is captured by the underlying latent space.

2 Related work

Generative networks learn an unsupervised model of a domain such as images or videos, so that the model can generate samples from the latent representation that it learns. For example, Dosovitskiy et al. [9] proposed a deep architecture to generate images of chairs based on a latent representation that encodes chair appearance as a function of attributes and viewpoint. Generative Adversarial Networks (GANs) [13]

are a particularly prominent example. GANs model general classification problems as finding an equilibrium point between two players, a generator and a discriminator, where the generator attempts to produce “confusing” examples and the discriminator attempts to correctly classify them. Radford

et al.[30] combined CNNs with GANs in their Deep Convolutional GANs (DCGANs). GANs have been applied to domains including images [25, 6], videos [37], and even emojies [36], and applications including face aging [1], robotic perception [39], colorization [3], color correction [20], editing [44], and in-painting [28].

Conditioned Generative Adversarial Networks (CGANs) [25] introduce a class label in both the generator and discriminator, allowing them to generate images with some specified properties, such as object and part attributes [31]. GAN-CLS [32] uses recurrent techniques to generate text encodings and DCGANs to generate higher-resolution images given input text descriptions, while Reed et al. [33] use text, segmentation masks, and part keypoints [33]. Dong et al. [8] use an image captioning module for textual data augmentation to enhance the performance of GAN-CLS. Perarnau et al. [29] modify the GAN architecture to generate images conditioned on specific attributes by training an attribute predictor, while InfoGAN [5] learns a more interpretable latent representation.

Stacked GAN architecture. To generate higher-quality images, coarse-to-fine approaches called Stacked GANs have been proposed. Denton et al. [6] describe a Laplacian pyramid GAN that generates images hierarchically: the first level generates a low resolution image, which is then fed into the subsequent stage, and so on. Zhang et al. [42] propose a two-stage GAN in which the first generates a low resolution image given a text input encoding, while the second improves its quality. Huang et al. [14] propose a multiple-stage GAN in which each stage is responsible for inverting a discriminative bottom-up mapping that is pre-trained for classification. Wang et al. [38] describe two sequential GANs, one that generates surface normals and a second that transforms the surface normals into an indoor image. Yang et al. [40] describe a recursive GAN that generates image backgrounds and foregrounds separately, and then combines them together.

GAN Framework modifications. The original GANs proposed by Goodfellow et al. [13] use a binary cross entropy loss to train the discriminator network. Zhao et al. [43] employ energy-based objectives such that the discriminator tries to assign high energy to the generated samples, while the generator attempts to generate samples with minimal energy. It uses an auto-encoder to enhance the stability of the GAN. Metz et al. [24]

define the generator objective based on unrolled optimization of the discriminator by optimizing a surrogate loss function. Mao

et al. [22] address the problem of vanishing gradients of the discriminator by proposing a least squares loss for the discriminator instead of a sigmoid cross-entropy. Salimans et al. [34] propose several training strategies to enhance the convergence of GANs, such as feature matching mini-batch discrimination. Nowozin et al. [27]

consider f-divergence measures for training generative models by regarding the GAN as a general variational divergence estimator. Tong

et al. [4] address instability of the network and the model-collapse problem by introducing two types of regularizers: geometric metrics (e.g., the pixelwise distance between the discriminator features and VGG features) and mode regularizers to penalize missing modes.

Image-to-Image GANs.

Many researchers have proposed GAN-based image-to-image translation techniques, most of which require training data with correspondences between images. Isola

et al. [15]

propose conditional GANs that map a random vector

and an input image from a source domain to an image in the target. Sangkloy et al. [35] synthesize images from rough sketches. Karacan et al. [17] generate realistic outdoor scenes based on input image layout and scene attributes. Other work has tried to train without image-to-image correspondences in the training dataset. For example, Taigman et al. [36] use a Domain Transfer Network (DTN) with a compound loss function for mapping between face emojis and face images. Kim et al. [18] propose DiscoGAN, which employs a generative model to learn a bijection between two domains based on loss functions that reduce the distance between the input image and the inverse mapping of the generated image. Zhu et al. [45] employ cycle-consistent loss to train adversarial networks for image-to-image translation.

3 Approach

Our goal is to study GANs for image generation across two different image domains – in particular, two TV cartoon series. While the last section gave an overview of GANs, we now focus on the techniques to address this particular task.

(a) Coupled GANs (b) GANs with domain adaptation (c) Generator for high-resolution images
Figure 2: Various network architectures that we consider.

3.1 Generative Adversarial Networks

Goodfellow [13] proposed a deep generative model as a min-max game between two agents: (1) a generative network

that models a probability distribution

of generated image samples with input that has distribution , and (2) a discriminator network which estimates the probability that the input sample is drawn from the real data distribution or from the generated distribution . Ideally if and if . The networks are trained by optimizing,

The optimization problem is solved by alternating between gradient ascent steps for the discriminator,

and descent steps on the generator,

3.2 Dcgan

Deep Convolutional GANs (DCGANs) [30]

include architectural constraints for deep unsupervised network training, many of which we follow here. Any deep architecture we employ consists of a set of modules (represented by rectangular blocks in network architectures). Each module in the generator consists of fractional-strided convolutions, batch normalization and ReLU activation for all layers except for the output, which uses hyperbolic tangent. In the discriminator, each block consists of strided convolutions, batch normalization and LeakyReLU activations.

3.3 Co-Gan

Coupled GANs (Co-GANs) [21] allow for a network to model multiple image domains. A CO-GAN consists of two (or more) GANs, where and represent the generative and discriminative network for the first and second domains, respectively (Figure 2(a)). Since the first layers of the discriminator encode low-level features while the later layers encode high-level features, we force the domains to share semantic information by tying the weights of their final discriminator layers together. Since the flow of information in generative networks is opposite (initial layers represent high-level concepts), we tie together the early layers of the generators. In learning, Co-GANs solve a constrained minimax game similar to that of GANs,

. We will use CO-GANs to explore semantic connections between Family Guy and The Simpsons and compare the results with other methods described in the following section.

3.4 Adversarial domain adaptation

In domain adaptation, we have an input space (images) and output space (class labels). The objective is to train the model using source domain distribution so as to maximize the performance on a target domain distribution . Both distributions are defined on where is “shifted” from by some domain offset.

The adversarial domain adaptation model [11] decomposes the input image feature representation into three parts: , which extracts feature representations , , which maps a feature vector to label , and , which maps a feature vector to either 0 or 1 according to whether it belongs to the source or target distribution, respectively. The objective is to obtain a domain invariant feature representation that maximizes the loss of the domain classifier while making the parameters minimize the loss of the domain classifier. This is accomplished by a gradient reversal unit (GRU), which acts as identity during forward propagation and value negation during backward propagation.

3.5 Domain adaptation for generative models

Consider two distributions, and , corresponding to two different input domains, and a set of samples from each domain, and . Each sample has a label corresponding to whether it is a fake or real image, respectively, and a label indicating whether it is from the first or second domain. The parameters for the model are , , , , and the network architecture is shown in Figure 2(b). Training involves optimizing an energy function,

where is the cross entropy loss,

where and is a binary layer log max loss for the domain classifier,

with Algorithm 1 shows the steps involved in this optimization. In the experimental results, we will examine the results of applying that technique on finding high level semantic alignments between the two domains of Simpsons and Family Guy.

1:Given: Minibatch size , learning rate , iteration count , randomly-initialized , , , , .
2:for  to  do
3:       Update discriminator parameters , where real images of both domains have label 1 and fake (generated) images have label .
(1)
4:       Update generative models for the two domains independently, by propagating the cross-entropy from the discriminative network, considering images generated from the generator as real,
(2)
(3)
5:       Update classifier parameters with real samples from both domains, where class labels indicate domain, and propagate error through the classifier and shared discriminator parameters,
(4)
6:       Repeat with fake generated images,
(5)
Algorithm 1 Training with domain adaptation

4 Experimental Results

Simpsons Family Guy

Figure 3: Sample frames generated by two independently-trained cartoon series models, at two different resolutions.

We now present results on generating images across our two domains of interest: frames from the cartoon series The Simpsons and Family Guy.

4.1 Datasets

To build a large-scale dataset, we took about 25 hours of DVDs corresponding to every episode for four years of each program (Simpsons seasons 7-10, Family Guy seasons 1-4). We performed screen captures at about Hz and a resolution of 1280 800, yielding about 400 snapshots per episode or about 30,000 frames per series in total.

4.2 Single domain training

We began by training a separate, independent network for each of our two domains, basing the network architecture and training procedure on the publicly-available Torch implementation by Radford

et al. [30]. We generated new pixel frames by passing independently-sampled random vectors into the network’s 1024D input. Figure 3(a) presents the results. Comparing these novel frames to those in the training set (Figure 1), we observe that the network seems to have captured the overall style and appearance of the characteristics of the original domains — the distinctive yellow color of the characters in the Simpsons, for example, versus the paler skin tones in Family Guy.

Simpsons
Family Guy
Figure 4: Nearest training neighbors for each of nine sample generated images (left image in each pane) for a GAN model trained on each independent cartoon series.

Nearest neighbors. To what extent are the generated frames really novel, and to what extent do they simply “copy” the training images? To help visualize this, for each generated frame we found the nearest-neighbors in the training set, according to Euclidean distance between the activation values of the second to the last layer of the discriminator. Figure 4 presents the results. Intuitively, the nearest neighbors give us some information about the “inspiration” that the network used in generating a new frame. We observe, for example, that the upper-left image in the Family Guy row looks like an image of the husband and wife talking in the bedroom, but the closest images retrieved from the training dataset are images of them talking in the kitchen; this is explained by the fact that the two rooms have very similar appearance in the series. In the top-middle example of the Simpsons row, the network appears to have generated a novel scene with some people appearing in a TV set, whereas the closest frames in the training images are the TV with different varieties of text.

Generating higher-resolution images. To improve the quality of the frames, we tried to generate images at a higher resolution of , as shown in Figure 3(b) using the network architecture of Figure 2(c). Ironically, this network produces frames that are subjectively worse: the resolution is higher, but the model seems not to have learned the important properties of the source domain, and diversity of generated samples is low. This seems to occur because the network has reached a model collapse case where it has learned to generate nearly identical images.

4.3 Coupled Domain Training

Instead of training models for the two domains in isolation, we next consider various jointly-trained models. We hypothesize that such jointly-trained models could potentially overcome the problems with generating higher-resolution images (by doubling the number of training examples), and could find semantic correspondences between frames across the two series.

(a) Frames generated by a single network trained with a mixture of Simpsons and Family Guy frames, at two resolutions.

(b) Nearest training neighbors for each of nine sample frames (left image in each pane) generated by the combined model.

Figure 5: Sample results from a single network trained on an unstructured (unlabeled) mix of the Simpsons and Family Guy data, showing (a) sample generated frames, and (b) nearest training set neighbors for some sample generated frames.

Combining the datasets. We first tried retraining the same model as before but with a dataset consisting of both series mixed together, and results are in Figure 5 for two different resolutions. Again, the lower-resolution images seem reasonable in both appearance and diversity, and for most frames we can identify characteristics of one or both of the two series. However, the combined dataset seems not to have helped the higher-resolution frame generator, as we still see evidence of model collapse. The second row of the figure again presents -nearest-neighbors in the combined training set, showing that sometimes the generated frames are most similar to one dataset, and sometimes seem to synthesize a combination of the two.

The Simpsons Family Guy

Figure 6: Sample frames generated by COGAN at different resolutions (rows) for each cartoon dataset (columns).

COGANs. We next test COGANs, which explicitly model that there are multiple image domains, as shown in Figure 6. We observe that the coupled training generated images with noticeably better quality at , compared to the results without coupled training (Figure 3) or the single model with combined training datasets (Figure 5).

Family Guy                     Simpsons

Family Guy                     Simpsons

Family Guy                     Simpsons

Family Guy                     Simpsons

Figure 7: Sample results of paired image generation with COGAN. In each pane, the top left and bottom right images were generated by the same random input vector passed to the Simpsons and Family Guy models, respectively. The remaining images show nearest neighbors in the corresponding dataset.
Discriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons
Family Guy
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white1

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white2

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white3

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white4


Simpsons
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white5

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white6


whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white7

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white8
Figure 8: Sample results of the Full Domain Adaptation model. For each generated frame (left image of each row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.

How well has the COGAN found and modeled semantic connections between the two domains? To test this, we fed the same random input vectors into both of the two models, as shown in each pane of Figure 7. Intuitively, the images generated from the same should be topically similar if COGAN has identified meaningful semantic connections; we observe, however that this does not appear to be the case, since images across domains in the figure are quite different. We also show the nearest neighbors of each generated frame in the corresponding training set.

Domain adaptation. Sample results for domain adaption are shown in Figure 8. We noticed during training that only some iterations of the model were able to generate good images for both domains; here we selected iterations that worked well. The figure also shows the nearest neighbors in each domain’s training set for each sample generated frame. We use two different features for finding nearest neighbors: (1) the second to the last layer of the discriminator, as before, and (2) second to the last layer of the classifier. The results suggest that the model managed to find correspondences in the main colors of the image in both domains.

Examining the results shown in Figure 8, we note that the model seemed to find some meaningful high-level semantic alignments. For discriminator similarity, for example, we see alignments in terms of people in theaters or stadiums (Family Guy images 1round((1-1)*5+5,0) and 1round((2-1)*5+4,0) with Simpsons images 1round((1-1)*5+10+1,0) and 1round((2-1)*5+10+1,0)), houses (images 2round((1-1)*5+1,0) through 2round((1-1)*5+5,0) with 2round((1-1)*5+10+1,0) and 2round((2-1)*5+10+3,0)), a person talking against a red background (row 3), cars in a parking lot (images 5round((1-1)*5+2,0) and 5round((2-1)*5+1,0) with 5round((1-1)*5+10+1,0) and 5round((1-1)*5+10+3,0)), and groups of people indoors (7round((1-1)*5+4,0) and 7round((1-1)*5+5,0) with 7round((1-1)*5+10+1,0) and 7round((1-1)*5+10+3,0) through 7round((2-1)*5+10+5,0)). With the similarity measure in the classifier’s feature space, a general theme seems to be people conversing in different indoor settings, including images 1round((2-1)*5+4,0) and 1round((2-1)*5+5,0) with 1round((1-1)*5+10+1,0), 1round((1-1)*5+10+2,0), and 1round((2-1)*5+10+1,0). Images 2round((1-1)*5+4,0) and 2round((2-1)*5+4,0) seem to relate to image 2round((1-1)*5+10+3,0) in that both have people talking against a green background, and in general the green background seems to be a theme of the whole row. The third row appears to roughly correspond with conversations between pairs of people, as in images 3round((1-1)*5+1,0), 3round((1-1)*5+3,0), and 3round((1-1)*5+5,0) with images 3round((1-1)*5+10+3,0) through 3round((2-1)*5+10+1,0), whereas the fifth row features single characters in the scene (e.g. images 5round((1-1)*5+1,0), 5round((1-1)*5+4,0), and 5round((2-1)*5+4,0) with 5round((1-1)*5+10+1,0) through 5round((1-1)*5+10+4,0)). Other themes include square-framed scenes (images 1round((2-1)*5+1,0) with 1round((1-1)*5+10+3,0) and 1round((2-1)*5+10+2,0) through 1round((2-1)*5+10+5,0)) and scenes with prominent buildings (image 6round((1-1)*5+5,0) and 6round((2-1)*5+2,0) with 6round((1-1)*5+10+1,0) through 6round((1-1)*5+10+3,0)).

The Simpsons Family Guy

Full Domain Adaptation

NoClassifierTraining

NoFakeClassifierTraining

NoRealClassifierTraining

LazyFakeClassifierTraining

Figure 9: Sample frames generated under different variants of the domain adaptation models.
Discriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons

Family Guy
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white1

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white2

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white3

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white4

Simpsons
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white5

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white6

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white7

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white8
Figure 10: Sample results of the NoClassifierTraining model. For each generated frame (left image of each row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.

Domain adaptation variants. To better understand the importance of different parts of the domain adaptation technique in Algorithm 1, we tried various variants, and present results in Figure 9. First, our NoClassifierTraining variant skips steps 5 and 6 to test if just the shared discriminator (without the classifier) is enough to achieve good mappings between domains. Figure 10 shows nearest neighbors for some sample frames generated by this variant of the model. For the discriminator similarities, we see much less evidence of semantic correspondence between frames than with the full model, although there is some in terms of simple features like overall color. For example, row 1 has a yellow theme (e.g. images 1round((2-1)*5+2,0) and 1round((2-1)*5+3,0) with 1round((1-1)*5+10+2,0), 1round((1-1)*5+10+3,0), 1round((2-1)*5+10+4,0), and 1round((2-1)*5+10+5,0)), row 2 has a violet theme (images 2round((1-1)*5+1,0)–2round((1-1)*5+3,0) and 2round((1-1)*5+5,0)–2round((2-1)*5+1,0) with 2round((1-1)*5+10+1,0), 2round((2-1)*5+10+2,0), and 2round((2-1)*5+10+5,0)), rows 4 and 8 features dark gray backgrounds, and row 5 has a combination of yellow and blue colors. The classifier similarities are not useful here, of course, since they have not been trained, and resemble clusterings of random vectors.

Discriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons

Family Guy
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white1

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white2

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white3

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white4

Simpsons
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white5

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white6

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white7

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white8
Figure 11: Sample results of the NoFakeClassifierTraining model. For each generated frame (left image of each row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.

NoFakeClassifierTraining includes step 5 but skips step 6, so that the classifier is trained only with real images, and results are shown in Figure 11. These results suggest the technique has once again found some higher-level semantic alignments in both similarity spaces, including the clouds of dust or smoke in images 1round((2-1)*5+3,0), 1round((1-1)*5+10+2,0), and 1round((1-1)*5+10+4,0), the large groups of people in 1round((1-1)*5+1,0), 1round((2-1)*5+1,0), 1round((2-1)*5+4,0) and 1round((2-1)*5+5,0) with 1round((2-1)*5+10+5,0), the paper documents in images 5round((1-1)*5+1,0)–5round((2-1)*5+5,0) with 5round((1-1)*5+10+1,0), 5round((2-1)*5+10+1,0), and 5round((2-1)*5+10+4,0), the single characters against a blue background in row 6, and the sky-colored background with white foreground objects like clouds in rows 7 and 8. Under classifier similarities, we also see some semantic themes, including several rows that seem to be cuing on certain facial reactions (e.g. images 5round((1-1)*5+1,0), 5round((2-1)*5+2,0), 5round((2-1)*5+4,0), and 5round((2-1)*5+5,0) with 5round((1-1)*5+10+2,0), 5round((2-1)*5+10+1,0), 5round((2-1)*5+10+2,0), and 5round((2-1)*5+10+4,0)).

Discriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons
Family Guy
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white1

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white2

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white3

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white4


Simpsons
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white5

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white6

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white7

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white8
Figure 12: Sample results of the NoRealClassifierTraining model. For each generated frame (left image of each row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.

NoRealClassifierTraining skips step 5 but not 6, so that the classifier is trained only with synthetic images. We find that in both techniques the model has generated images with good correspondence in the two techniques, which suggests that the mapping is many to many and not bi-jective, as shown in Figure  12. Examples of semantic-level alignments seem to include frames with single humans against a blue background (images 2round((1-1)*5+1,0)–2round((2-1)*5+3,0) with 2round((1-1)*5+10+3,0), 2round((1-1)*5+10+4,0), 2round((2-1)*5+10+1,0), and 2round((2-1)*5+10+5,0)), two characters talking (images 3round((1-1)*5+1,0)–round((2-1)*5+5,0) with 3round((1-1)*5+10+3,0), 3round((1-1)*5+10+5,0), and 3round((2-1)*5+10+5,0)), large crowds of people in row 4, and similar color themes in the remaining rows. Classifier similarity results seems to find mostly alignments based on similar character configurations and activities.

Discriminator similarity Classifier similarity
Frame Family Guy Simpsons Family Guy Simpsons

Family Guy
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white1

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white2

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white3

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white4

Simpsons
whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white5

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white6

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white7

whitea       whiteb       whitec       whited       whitee whitek       whitel       whitem       whiten       whiteo whiteA       whiteB       whiteC       whiteD       whiteE whiteK       whiteL       whiteM       whiteN       whiteO
whitef       whiteg       whiteh       whitei       whitej whitep       whiteq       whiter       whites       whitet whiteF       whiteG       whiteH       whiteI       whiteJ whiteP       whiteQ       whiteR       whiteS       whiteT
white8
Figure 13: Sample results of the LazyFakeClassifierTraining model. For each generated frame (left image of each row), we show the 10 nearest neighbors in each of the two training datasets, using two different definitions of similarity.

Finally, LazyFakeClassifierTraining skips step 6 during the initial few iterations, so that the classifier only trains with fake images once they start becoming realistic. Examining the results shown in Figure 13, we again see relatively good high-level semantic alignments, including: interactions between two main characters (images 1round((1-1)*5+5,0) and 1round((2-1)*5+5,0) with 1round((1-1)*5+10+1,0)–1round((2-1)*5+10+5,0)), blue-green color schemes (row 2), large groups of indoor people (row 3), two characters interacting (images 8round((1-1)*5+2,0) and 8round((1-1)*5+3,0) with 8round((1-1)*5+10+1,0)–round((2-1)*5+10+5,0)), etc. Particularly interesting are rows 5 and 6, where many images correspond with scenes appearing on a TV screen (and thus framed by a rectangular “viewing window”). Correspondences in the classifier similarity space are also readily apparent.

We note that when training with both datasets jointly (as opposed to training them independently as just one dataset) the model manages to generate samples that represent different color styles of the two series separately, as shown in Figure 14. When trained with a single dataset, the model finds it hard to build a joint color space for both domains so it alternates between them. The first row of Figure 14

shows samples generated after the first training epoch of Co-GAN model for Family Guy and Simpsons, respectively, while the second row shows samples from the domain adaptation model. Both the first and second rows show that the network detects the difference from the first epoch. The third row of the figure shows the first and second epoch of the combined dataset image generation. We notice that the model tries to find a common color space between the two domains and the results change drastically between the two different epochs.

Simpsons Family Guy
First COGAN iteration
First DA iteration
First iteration Second iteration
Simpsons + Family Guy
Figure 14: Samples generated by various training iterations.

4.4 An application

A potential direct application of our model is in cross-domain image retrieval, and in this case, finding semantically-similar episodes across two different cartoon series. We test both the discriminator and classifier feature-based similarities for

FullDomainAdaptation. We view each episode as a bag-of-frames, and then given an episode in one domain and an episode in another, we calculate a distance measure: for every frame in the first episode, we find the closest frame in the second measure, and each frame is allowed to be paired to at most one other frame from the other episode. The distance between the two episodes is the minimum closest frame distance.

Figure 15 shows four sample retrieval results for both the discriminator similarity measure (top) and classifier similarity measure (bottom). Within each result, the top row shows sample frames from a Family Guy query video, and the bottom shows the matched frames from the most similar Simpsons episode. While we do not have ground truth to evaluate quantitatively, we see that the retrieved results do share similarities in terms of overall scene composition. Examining the results for the discriminator similarity, for example, the episodes in the first example both features “synthetic”-looking 3d models of characters, while the episodes in the second example feature objects against a blue sky and close-ups of people. The third example shows indoor rooms from a specific viewpoint, and the fourth seems to match black and white scenes in Family Guy with Cowboy scenes in Simpsons. Meanwhile, the classifier similarity measure seems to have found simiar episodes in terms of appearance of main characters and similar background colors.

Discriminator Similarity
Classifier Similarity
Figure 15: Cross-domain similar episode retrieval examples using both discriminator (top) and classifier (bottom) similarity. In each panel, the top row shows a subset of frames from a query video, and the bottom shows the corresponding matched frames from the most similar video in the other domain.

5 Conclusion and Future Work

We have studied finding high level semantic mappings between cartoon frames using GANs, as a first step towards finding general semantic connections between videos. We show that this problem is many-to-many (not bi-jective which means that the same scene in one domain can map to different scene in the other domains each mapping can have a high level semantic meaning) and that some models can find reasonable high-level semantic alignments between the two domains. Our work also shows, however, that this is still an open research problem. Future work should consider the temporal dimension of video, as well as adding other modalities like subtitles and audio. Beyond the insight that our analysis gives about GANs, it also creates the opportunity for interesting applications in the specific domain of cartoons. For example, a common practice among fans is to create correspondences between live-action movies and TV cartoon series [7, 26], such as parodies. Our work raises the intriguing possibility that such mappings between domains could be created completely automatically by GAN models. By training on TV series as opposed to individual images, it may even be possible to sample entirely new story lines, generating new episodes that fit the stylistic mores of a given series, completely automatically!

References