In Chinese Social Community Networks (SNS), hand-made meme-faces stand as a new fashion and even a special kind of pop-culture. Different from the classic emoji widely applied in the Apps such as WhatsApp111https://www.whatsapp.com/, Twitter222https://twitter.com/, etc. Radpour and Bheda (2017); Cunha et al. (2018), the meme-faces used in the Chinese SNS (e.g., WeChat333https://weixin.qq.com/, Weibo444https://www.weibo.com/, etc.) can be transformed from various meaningful images to express more abstract emotions.
Manually creating a meme-face is actually a re-creation process on the basis of an existing image from the web, which is one of the facts that the emotion presenting capacity can be attributed to. Generally, as illustrated by Figure 1, the creating procedure of a meme-face consists of the following operations: a) Picking a meaningful image and changing its details to make it amusing (e.g. switching face to a celebrity); and b) Based on a given text to be presented, further adding or changing some details to align the semantic of the image to the text. In this entire procedure, computer is only taken as an image editing tool, without any inspiration-related function.
It should be noted that, recently, learning to bridge the semantics between image and natural language has become an active research area, since it has been driving the studies upon multi-modal learning on vision and natural language Antol et al. (2015); Reed et al. (2016c); Yang et al. (2016); Reed et al. (2016a). Especially, after the emerging and widely spreading of the Generative Adversarial Networks (GANs) Goodfellow et al. (2014); Mirza and Osindero (2014), directly generating images or emojis according to the given natural language descriptions via the End-to-End learning becomes achievable Reed et al. (2016b); Zhang et al. (2017a, b); Radpour and Bheda (2017); Xu et al. (2018).
This paper aims at generating the meme-face according to the given text input directly, which is theoretically similar with those tasks discussed in Reed et al. (2016b); Zhang et al. (2017a, b); Dash et al. (2017); Xu et al. (2018), with the only difference on the type of data. However, such difference brings the notable challenge to the model, since the image part of a hand-made Chinese meme-face is often only closely related to the text part in terms of semantics. For the images in the classic datasets such as the widely used COCO dataset Lin et al. (2014) and the CUB dataset Wah et al. (2011)
, by contrast, the captions provide the descriptions of images by maintaining the lexicon-object matching relationships. Overall, the semantics relationship between the text and the image part of such meme-face is beyond the lexicon-object correspondence.
For the Chinese meme-face generation task, this paper applies the GAN architecture with the attention module to exploit and capture the complicated relationships of texts and image regions Xu et al. (2018). Especially, so as to address the latent semantic relevance to enhance generation, this paper proposes to adopt basic patterns in meme-face templates as the supplementary signal, which enforces the generator to focus on the modification of the essential local regions to semantically match the given text caption, with the majority of the pattern fixed. The demonstration of our system is released online, with details presented in the rest sections.
2.1 Model Overview
Based on the current progress on text to image generation models, we propose a GAN architecture with the attention module named MemeFaceGenerator to generate a meme-face from a pattern representing the semantics of the given text. As shown in Figure 2, stacked attentional generative network Xu et al. (2018) is first employed to model the text into visual-semantic representations , where is the Conditioning Augmentation,
stands for the attention model for obtaining new sentence embedding at thestage, and represents the generators of the AttnGAN. Based on such representations and different scaled patterns, the editing component is proposed to generate the text aligned and pattern closed new meme-faces. Let stands for the pattern at stage-, network down-samples and
into representations with the same dimension, then concatenates them and performs a Multilayer Perceptron (MLP) to integrate the information from both the text and the pattern. Finally,utilizes a series of up-sampling blocks to generate images of small-to-large scales (; where is the number of stages). Beside the generator , the architectures of discriminators () and the text-image matching network Deep Attentional Multimodal Similarity (DAMSM) are inherited from the AttnGAN.
Correspondingly, the objective function of the whole generator is defined as:
is the sentence vector modeled by a biLSTM layer, and theloss is computed by a pretrained DAMSM on the Chinese meme-face dataset to evaluate the matching degree between the given text and generated images. Overall, the following objective function is performed for the training procedure:
where is a read image from the true data distribution which belonging to the class represented by the pattern , and
is the noise vector sampled from a uniform distribution.
2.2 Data Preparation
As part of experiment preparation, 56,710 meme-faces of various kinds are first collected online, and their text captions are automatically extracted via OCR engine. To obtain those representative and popular meme-faces, we further implement a series of data cleaning based on the categories of meme-face and the words in text caption. More specifically:
Since most Chinese meme-faces are made based on a small set of templates (such as the panda face in Figure 1), clustering is used to find meme-faces from same templates. Here we use a pre-trained inception-v3 network Szegedy et al. (2016)
To refine the quality of text captions in the dataset, a language model (trained on text captions) is used to filter those with too little information (e.g. Yeap, Harry Up, What) by setting a proper range of perplexity. In addition, we also set a constraint on the length of text captions by only reserving those text of length between 3 and 12.
Finally, we crop the meme-face pictures to trim away text captions. Implementing the above data preparation process gives us in total of 2955 meaningful meme-faces for training (90%) and testing (10%).
2.3 Training Details
For model training, our MemeFaceGenerator is optimized using Adam Kingma and Ba (2014)
with the learning rate of 0.0002 using a batch size of 14. The model is trained for 200 epochs in total. Throughout the training process, we update the generator every five epochs and discriminator every epoch. All neural-networks are constructed via Pytorch 0.3.0Paszke et al. (2017).
3 System Demonstration
As depicted in Figure 3, the web server 555The screen recording of the web page can be found at https://drive.google.com/file/d/1ewCLGds681LNtRDdwxPkfApWaAmjF03x/view of our system demonstration is constructed using vue and tornado. After typing the text in the top input box, the web server first transfers the input text from the front-end into json format, and post it to back-end as model input. The model loads the saved parameters every five epochs till the final round and generates corresponding meme-face using the text as input for each loaded epoch. Then we post the generated meme-faces sequentially on the web front-end to demonstrate how does an output meme-face vary throughout the training process, together with the system log containing the time elapse and checkpoint information. All meme-faces are 256x256 pixels and encoded using base64 schemes.
The generated image sequence on the left side of the web page gives us a visual hint on the learning process of MemeFaceGenerator throughout the training. Taking the case in Figure 3 as an example, after introducing the template information of panda face, the model rapidly learns the profile and general features of meme-face. Then it starts to focus on generating more detailed facial expression, and adjust the facial expression to match with the input text gradually. In the end, MemeFaceGenerator synthesizes a smiling face which is semantically relevant to the input text Wow, not bad.
4.1 Case Study
Ideally, the desired meme-face generator should be capable of generating image consistent with its text caption, in other words, the generated image is expected to be semantically-relevant to the text caption. To verify such capability of our generator, in the following section, several generated meme-faces from test set are selected to analyze the acuity and rationality of our proposed MemeFaceGenerator in semantic-relevance capturing.
It can be observed from Figure 4 that given the text input lil cutie I’m here, the GAN architecture indeed captures the semantics in lil cutie, reflected by the blushing on cheeks of the image. Similarly in Figure 4, our meme-face generator extracts the phrase hang up (the phone) from the text input and generates a telephone receiver in the image accordingly.
In addition, rather than capturing relatively straight-forward semantics between the text inputs and images of meme-face, Figure 4 and Figure 4 show signals that the proposed meme-face generator also apprehends more latent semantics between inputs and outputs. Given the text inputs just go, dare you hit the road don’t come back and listen to me, stop arguing and put up a fight, the generator extracts the sentiments of sadness and provocation from the text and generates meme-faces with tears and a provoking (shouting) posture to represent the extracted sentiments.
Apart from the face details, the meme-face generator also learns the relevance between faces in different templates and text sentiments through the training. For example, Figure 4 and Figure 4 show famous meme-smiling from wrestler D’Angelo Dinero and actor Choi Sung-kook, which are consistent with the ironic and complimentary emotions in their corresponding text inputs Best drama queen of the year and Wow, not bad.
Overall, the synthesizing results indicate that the visual-semantic representations indeed capture various kinds of semantics from the text input. With the help of the stacked attentional generative network, MemeFaceGenerator is capable of generating images with details and emotions agreeing with the text inputs. Such synthesizing results can be directly used in daily communication in social community networks without further manual editing.
4.2 Numerical Evaluation
To further analyze the quality of generated results quantitatively, the generated meme-faces in the test set are cross-evaluated by 3 annotators, under the following labeling criterion:
0: the quality of image is poor or inconsistent with the text caption.
1: the quality of image is acceptable but the image itself is not closely relevant to its text caption.
2: the image is interesting in terms of its content and matches with its text caption as well.
Overall, the ratios of images labelled as 2, 1 and 0 are 38.8%, 43.4% and 17.8% respectively. More than 80% of generated images in the test set are annotated as 1 or higher, which further verifies the capability of MemeFaceGenerator on modeling the latent semantic relevance between input text and output image.
In this paper, we exploit the Chinese meme-face generation task using GAN architecture with attention module. To improve the quality of generated results, meme-face template information is utilized during model training, and precise data-cleaning process is also implemented. Our proposed MemeFaceGenerator is proved to be able to successfully generate meme-faces consistent with text inputs, it can capture multiple types of semantics between generated images and text captions, reflected by the various details and facial expressions in the generated meme-faces.
Antol et al. (2015)
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. 2015.
Vqa: Visual question answering.
Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Cunha et al. (2018) João Miguel Cunha, Pedro Martins, and Penousal Machado. 2018. How shell and horn make a unicorn: Experimenting with visual blending in emoji. In ICCC, pages 145–152.
- Dash et al. (2017) Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal. 2017. Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv preprint arXiv:1703.06412.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
- Radpour and Bheda (2017) Dianna Radpour and Vivek Bheda. 2017. Conditional generative adversarial networks for emoji synthesis with word embedding manipulation. arXiv preprint arXiv:1712.04421.
Reed et al. (2016a)
Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016a.
Learning deep representations of fine-grained visual descriptions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 49–58.
- Reed et al. (2016b) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016b. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396.
- Reed et al. (2016c) Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016c. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217–225.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
- Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200-2011 dataset.
- Xu et al. (2018) Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316–1324.
- Yang et al. (2016) Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29.
- Zhang et al. (2017a) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. 2017a. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916.
- Zhang et al. (2017b) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017b. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915.