MemeFaceGenerator: Adversarial Synthesis of Chinese Meme-face from Natural Sentences

08/14/2019 ∙ by Yifu Chen, et al. ∙ Tencent Peking University 0

Chinese meme-face is a special kind of internet subculture widely spread in Chinese Social Community Networks. It usually consists of a template image modified by some amusing details and a text caption. In this paper, we present MemeFaceGenerator, a Generative Adversarial Network with the attention module and template information as supplementary signals, to automatically generate meme-faces from text inputs. We also develop a web service as system demonstration of meme-face synthesis. MemeFaceGenerator has been shown to be capable of generating high-quality meme-faces from random text inputs.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Chinese Social Community Networks (SNS), hand-made meme-faces stand as a new fashion and even a special kind of pop-culture. Different from the classic emoji widely applied in the Apps such as WhatsApp111, Twitter222, etc. Radpour and Bheda (2017); Cunha et al. (2018), the meme-faces used in the Chinese SNS (e.g., WeChat333, Weibo444, etc.) can be transformed from various meaningful images to express more abstract emotions.

Manually creating a meme-face is actually a re-creation process on the basis of an existing image from the web, which is one of the facts that the emotion presenting capacity can be attributed to. Generally, as illustrated by Figure 1, the creating procedure of a meme-face consists of the following operations: a) Picking a meaningful image and changing its details to make it amusing (e.g. switching face to a celebrity); and b) Based on a given text to be presented, further adding or changing some details to align the semantic of the image to the text. In this entire procedure, computer is only taken as an image editing tool, without any inspiration-related function.

Figure 1: The illustration of the creation of a hand-made meme-face, taking the input Stop it I am blushing (translated) as the example.

It should be noted that, recently, learning to bridge the semantics between image and natural language has become an active research area, since it has been driving the studies upon multi-modal learning on vision and natural language Antol et al. (2015); Reed et al. (2016c); Yang et al. (2016); Reed et al. (2016a). Especially, after the emerging and widely spreading of the Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Mirza and Osindero (2014), directly generating images or emojis according to the given natural language descriptions via the End-to-End learning becomes achievable Reed et al. (2016b); Zhang et al. (2017a, b); Radpour and Bheda (2017); Xu et al. (2018).

Figure 2: The architecture of the proposed End-to-End Meme-Face Generator. Due to the generation difficulty brought by the implicit semantic relevance of the text and image within a Chinese meme-face, we propose to introduce templates into each generation step.

This paper aims at generating the meme-face according to the given text input directly, which is theoretically similar with those tasks discussed in Reed et al. (2016b); Zhang et al. (2017a, b); Dash et al. (2017); Xu et al. (2018), with the only difference on the type of data. However, such difference brings the notable challenge to the model, since the image part of a hand-made Chinese meme-face is often only closely related to the text part in terms of semantics. For the images in the classic datasets such as the widely used COCO dataset Lin et al. (2014) and the CUB dataset Wah et al. (2011)

, by contrast, the captions provide the descriptions of images by maintaining the lexicon-object matching relationships. Overall, the semantics relationship between the text and the image part of such meme-face is beyond the lexicon-object correspondence.

For the Chinese meme-face generation task, this paper applies the GAN architecture with the attention module to exploit and capture the complicated relationships of texts and image regions Xu et al. (2018). Especially, so as to address the latent semantic relevance to enhance generation, this paper proposes to adopt basic patterns in meme-face templates as the supplementary signal, which enforces the generator to focus on the modification of the essential local regions to semantically match the given text caption, with the majority of the pattern fixed. The demonstration of our system is released online, with details presented in the rest sections.

2 Approach

2.1 Model Overview

Based on the current progress on text to image generation models, we propose a GAN architecture with the attention module named MemeFaceGenerator to generate a meme-face from a pattern representing the semantics of the given text. As shown in Figure 2, stacked attentional generative network Xu et al. (2018) is first employed to model the text into visual-semantic representations , where is the Conditioning Augmentation,

stands for the attention model for obtaining new sentence embedding at the

stage, and represents the generators of the AttnGAN. Based on such representations and different scaled patterns, the editing component is proposed to generate the text aligned and pattern closed new meme-faces. Let stands for the pattern at stage-, network down-samples and

into representations with the same dimension, then concatenates them and performs a Multilayer Perceptron (MLP) to integrate the information from both the text and the pattern. Finally,

utilizes a series of up-sampling blocks to generate images of small-to-large scales (; where is the number of stages). Beside the generator , the architectures of discriminators () and the text-image matching network Deep Attentional Multimodal Similarity (DAMSM) are inherited from the AttnGAN.

Correspondingly, the objective function of the whole generator is defined as:



is the sentence vector modeled by a biLSTM layer, and the

loss is computed by a pretrained DAMSM on the Chinese meme-face dataset to evaluate the matching degree between the given text and generated images. Overall, the following objective function is performed for the training procedure:


where is a read image from the true data distribution which belonging to the class represented by the pattern , and

is the noise vector sampled from a uniform distribution.

2.2 Data Preparation

As part of experiment preparation, 56,710 meme-faces of various kinds are first collected online, and their text captions are automatically extracted via OCR engine. To obtain those representative and popular meme-faces, we further implement a series of data cleaning based on the categories of meme-face and the words in text caption. More specifically:

  • Since most Chinese meme-faces are made based on a small set of templates (such as the panda face in Figure 1), clustering is used to find meme-faces from same templates. Here we use a pre-trained inception-v3 network Szegedy et al. (2016)

    to first compute image-vectors from all meme-faces, and then use k-means for image clustering. In total 33 representative categories of meme-faces are obtained after removing the outliers.

  • To refine the quality of text captions in the dataset, a language model (trained on text captions) is used to filter those with too little information (e.g. Yeap, Harry Up, What) by setting a proper range of perplexity. In addition, we also set a constraint on the length of text captions by only reserving those text of length between 3 and 12.

Finally, we crop the meme-face pictures to trim away text captions. Implementing the above data preparation process gives us in total of 2955 meaningful meme-faces for training (90%) and testing (10%).

2.3 Training Details

For model training, our MemeFaceGenerator is optimized using Adam Kingma and Ba (2014)

with the learning rate of 0.0002 using a batch size of 14. The model is trained for 200 epochs in total. Throughout the training process, we update the generator every five epochs and discriminator every epoch. All neural-networks are constructed via Pytorch 0.3.0 

Paszke et al. (2017).

Figure 3: The overview of the system demonstration. Its layout consists of an input box at the top of page, the sequence of generated images for every five epoch one the left side of the page, and corresponding system log on the right hand side of the page

3 System Demonstration

As depicted in Figure 3, the web server 555The screen recording of the web page can be found at of our system demonstration is constructed using vue and tornado. After typing the text in the top input box, the web server first transfers the input text from the front-end into json format, and post it to back-end as model input. The model loads the saved parameters every five epochs till the final round and generates corresponding meme-face using the text as input for each loaded epoch. Then we post the generated meme-faces sequentially on the web front-end to demonstrate how does an output meme-face vary throughout the training process, together with the system log containing the time elapse and checkpoint information. All meme-faces are 256x256 pixels and encoded using base64 schemes.

The generated image sequence on the left side of the web page gives us a visual hint on the learning process of MemeFaceGenerator throughout the training. Taking the case in Figure 3 as an example, after introducing the template information of panda face, the model rapidly learns the profile and general features of meme-face. Then it starts to focus on generating more detailed facial expression, and adjust the facial expression to match with the input text gradually. In the end, MemeFaceGenerator synthesizes a smiling face which is semantically relevant to the input text Wow, not bad.

4 Analysis

4.1 Case Study

Figure 4: Six generated meme-faces from test set from text inputs (translated): (a) lil cutie I’m here.; (b) just go, dare you hit the road don’t come back.; (c) Best drama queen of the year.; (d) Nothing important, I’ll hang up first.; (e) listen to me, stop arguing and put up a fight.; (f) Wow, not bad.

Ideally, the desired meme-face generator should be capable of generating image consistent with its text caption, in other words, the generated image is expected to be semantically-relevant to the text caption. To verify such capability of our generator, in the following section, several generated meme-faces from test set are selected to analyze the acuity and rationality of our proposed MemeFaceGenerator in semantic-relevance capturing.

It can be observed from Figure 4 that given the text input lil cutie I’m here, the GAN architecture indeed captures the semantics in lil cutie, reflected by the blushing on cheeks of the image. Similarly in Figure 4, our meme-face generator extracts the phrase hang up (the phone) from the text input and generates a telephone receiver in the image accordingly.

In addition, rather than capturing relatively straight-forward semantics between the text inputs and images of meme-face, Figure 4 and Figure 4 show signals that the proposed meme-face generator also apprehends more latent semantics between inputs and outputs. Given the text inputs just go, dare you hit the road don’t come back and listen to me, stop arguing and put up a fight, the generator extracts the sentiments of sadness and provocation from the text and generates meme-faces with tears and a provoking (shouting) posture to represent the extracted sentiments.

Apart from the face details, the meme-face generator also learns the relevance between faces in different templates and text sentiments through the training. For example, Figure 4 and Figure 4 show famous meme-smiling from wrestler D’Angelo Dinero and actor Choi Sung-kook, which are consistent with the ironic and complimentary emotions in their corresponding text inputs Best drama queen of the year and Wow, not bad.

Overall, the synthesizing results indicate that the visual-semantic representations indeed capture various kinds of semantics from the text input. With the help of the stacked attentional generative network, MemeFaceGenerator is capable of generating images with details and emotions agreeing with the text inputs. Such synthesizing results can be directly used in daily communication in social community networks without further manual editing.

4.2 Numerical Evaluation

To further analyze the quality of generated results quantitatively, the generated meme-faces in the test set are cross-evaluated by 3 annotators, under the following labeling criterion:

  • [label=]

  • 0: the quality of image is poor or inconsistent with the text caption.

  • 1: the quality of image is acceptable but the image itself is not closely relevant to its text caption.

  • 2: the image is interesting in terms of its content and matches with its text caption as well.

Overall, the ratios of images labelled as 2, 1 and 0 are 38.8%, 43.4% and 17.8% respectively. More than 80% of generated images in the test set are annotated as 1 or higher, which further verifies the capability of MemeFaceGenerator on modeling the latent semantic relevance between input text and output image.

5 Conclusions

In this paper, we exploit the Chinese meme-face generation task using GAN architecture with attention module. To improve the quality of generated results, meme-face template information is utilized during model training, and precise data-cleaning process is also implemented. Our proposed MemeFaceGenerator is proved to be able to successfully generate meme-faces consistent with text inputs, it can capture multiple types of semantics between generated images and text captions, reflected by the various details and facial expressions in the generated meme-faces.