Memeify: A Large-Scale Meme Generation System

10/27/2019 ∙ by Suryatej Reddy Vyalla, et al. ∙ 0

Interest in the research areas related to meme propagation and generation has been increasing rapidly in the last couple of years. Meme datasets available online are either specific to a context or contain no class information. Here, we prepare a large-scale dataset of memes with captions and class labels. The dataset consists of 1.1 million meme captions from 128 classes. We also provide reasoning for the existence of broad categories, called "themes" across the meme dataset; each theme consists of multiple meme classes. Our generation system uses a trained state-of-the-art transformer-based model for caption generation by employing an encoder-decoder architecture. We develop a web interface, called Memeify for users to generate memes of their choice, and explain in detail, the working of individual components of the system. We also perform a qualitative evaluation of the generated memes by conducting a user study. A link to the demonstration of the Memeify system is



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Memes (Dawkins, 1976) are currently the hottest trending ways of expressing ideas and opinions on social media. Ever since the social media boom happened in the late 2000s, memes have conquered the Internet landscape in ways no one could have imagined. An average social media user has access to hundreds of memes every day on various platforms. Creating a meme requires context and creativity (Gatt and Krahmer, 2018)

. Recent advances in deep learning and natural language processing have allowed neural networks to be used for generative tasks

(Lehman et al., 2016) (Colombo et al., 2017) (Goodfellow et al., 2014).

Peirson et al. (Peirson et al., 2018)

were one of the first to propose a meme generation network by modeling the task of meme generation as an image captioning problem. Their dataset does not contain class labels. They used standard glove embeddings to create word vectors of captions and modelled the generation task using a simple encoder decoder architecture.

There have been significant advances in deep natural language processing tasks ever since (Liu et al., 2019b) (Yang et al., 2019). The BERT (Devlin et al., 2018)

model generates vectors for each word by training a transformer network to extract deep bidirectional representations. The MT-DNN

(Liu et al., 2019a) model generalizes the BERT bidirectional language model by applying an effective regularization mechanism and creates the word vectors. Our model leverages these latest techniques for generating memes.

In this paper, we explain our proposed meme generation system, called Memeify which can creatively generate captions given a context. We first build a large-scale dataset that is both reliable and robust. Further, we draw some inferences on the characteristics of memes based on their image backgrounds and captions. Finally, we propose an end-to-end meme generation system that allows users to generate memes on our Memeify web application.

The major contributions of the paper are two-fold: (i) generation of a large-scale meme dataset, which to our knowledge, is the first of its kind, and (ii) design of a novel web application to generate memes in real time.

For the sake of reproducible research, we have made the code and dataset public at

2. Dataset Description

(a) TSNE plot of word clusters
(b) TSNE plot of image clusters
Figure 1. Depiction of the meme clusters formed when applying image vectorization and word vectorization.

2.1. Data Curation

We created our own dataset of meme images, their captions and corresponding class information. Every meme belongs to a particular class depending on the base image of the meme. The current datasets on memes available online have various limitations. The Reddit meme dataset on kaggle (Goswami, 2018) does not have captions for the memes and the class information. The meme generator dataset (Nath, 2018) contains 90k images with captions and class labels. But there is a huge class imbalance and we could not download most of the images because the links were broken.

To overcome these issues, we scraped data from QuickMeme (QuickMeme, 2016) and a few other sources to create a dataset of million memes belonging to classes. We have base images for all the classes and captions for each meme. We require such a large dataset to train our meme generation model.

2.2. Dataset Analysis: Themes

To understand our data, we sample (stratified) 5,000 memes and for each meme, we create an average word embedding of the captions to perform clustering (Berkhin, 2006). Visualization of these clusters is shown in Figure 1(a). We observe that there are 5 distinct clusters in the data, and the remaining embeddings are equally spread out in the space. The results were consistent when we tried on different samples.

From this, we hypothesize that every meme can be associated with a broad category, which we call ‘theme’, based on the words in the caption. We segregate meme classes into themes as identified by the clustering algorithm and labeled them as “Savage”, “Depressing”, “Unexpected”, “Frustrated” and “Wholesome”. The remaining classes are labeled as “Normie”. This labeling is done on the basis of captions and their usage on social media platforms. A class is labelled theme if more than 90% of the memes from class belong to cluster . Using this condition, every class is assigned a cluster (theme) as shown in Table 1.

Theme Count
Normie 44
Savage 22
Depressing 18
Unexpected 20
Frustrated 14
Wholesome 10
Table 1. Assignment of classes to themes.

To confirm our hypothesis, we create image vectors for each of our memes using a VGG16 convnet (Simonyan and Zisserman, 2014). We then cluster these vectors and assign themes as labels for plotting. The result is shown in Figure 1(b). We infer that the theme of a meme depends on the words in its caption and not the background image.

3. MEMEIFY: Our Proposed Architecture

(a) Memeify web application landing page
(b) Memeify web application showing a meme
Figure 2. Memeify – demo system images.

3.1. Generation Model

We consider the problem of meme generation as a language modeling task to produce funny and apt captions when given an input image or a class label as a prompt. We train multiple deep learning models, i.e., LSTM networks (Hochreiter and Schmidhuber, 1997) and their variants, for modeling the text from our meme corpus. However, qualitative analysis of the generated captions reveal that most of the models have the following limitations –

  • They do not accurately capture the class information.

  • They are not able to reproduce humour well.

To mitigate these issues, we use the transformer based GPT-2 architecture

(Radford et al., 2019) as our base language generative model. We incorporate the class information for the different memes by prepending a particular meme caption with its class name. This helps us generate class specific meme captions by enforcing the model to use the class information. The GPT-2 architecture also solves the issue of expressing humour in the text due to its self-attention capabilities (Vaswani et al., 2017), large-scale generative pre-training and multiple-task adaptability.

We use this generative model trained with the class information as the caption generator in the Memeify system. A few examples of the generated captions are shown in Table 2.

The Memeify system enables the generation of memes in two specific ways:

  1. Randomization: As explained in Section 2, every meme that we generate has an associated class and theme. We enable the users to randomly pick a theme and a class, and we then use this class as a seed to the generative model which then produces an appropriate caption. The corresponding image for the meme is retrieved from a database of default meme class images.

  2. Customization:

    A user can upload custom images into the system, and the system can aptly produce captions fitting the image. To correctly identify the theme and class of the required meme, we use a similarity matching algorithm that classifies the image into its correct contextual theme. For this, we use a VGG16 convnet

    (Simonyan and Zisserman, 2014)

    pretrained on ImageNet

    (Deng et al., 2009) as a feature extractor to convert images into their corresponding feature vectors. We extract feature vectors for each of the default class images from the database and then use a locality sensitive hashing algorithm (Gionis et al., 1999) to create a lookup table for these default images. For every new image uploaded into the system, we convert the image into a feature vector and then obtain the meme class by referring to the lookup table. The obtained class is then used as a seed to the generative model for the caption generation.

Figure 3. Architecture of Memeify.

3.2. Web Application

We develop a web interface for users to interact with our meme generation algorithm. The landing page of the website is shown in Figure 2. The users are provided with two options based on the model explained above:

  1. Random Image: The users can choose to generate a random meme. For this, they will have to select a theme and then a class on the website. This data is sent to the backend written in Flask (Projects, 2010) that uses the trained model to generate memes.

  2. Custom Image: The users can also generate a meme caption for an image of their choice. The user uploads an image onto the website. The image is sent to our server, and a caption is generated.

3.3. Engineering

  1. The generation model is the bottleneck in the pipeline. Therefore, we implement a redis (Labs, 2009) cache on our backend. We cache generated memes for a particular type of request so that we can save computation. We ensure that memes are not repeated for a single user by using web sessions. The cache is refreshed every 4 hours.

  2. All data transfers to and from the user’s webpage to the server happen asynchronously using AJAX (Javascript, 1999). This ensures a seamless experience for the user with minimum delay. The page is required to be refreshed for generating a new meme.

The complete pipeline is shown in Figure 3.

4. Evaluation and Analysis

Default image Original caption Memeify-generated caption Class Theme
  • not sure if smart or just british

  • not sure if joking or serious

  • not sure if trolling or stupid

  • not sure if intelligent or surrounded by idiots

futuruma_fry Frustrated
  • brace yourselves winter is coming

  • brace yourselves shitty memes are coming

  • brace yourselves 9 AM snoozes are coming

  • brace yourselves email overload is coming

imminent_ned Normie
Table 2. Examples from the dataset and generation system showing default image, original captions, generated captions using Memeify, class and theme.

To effectively evaluate the quality of the memes generated and the robustness of the generative model, we performed a human-evaluation by conducting a user study. 20 human experts111They were social media experts, and their age ranged from 25 to 40 years. volunteered for the study. We performed three different analysis/evaluation tasks to qualitatively understand the working of the generative model. We consider the Dank Learning meme generator proposed by Peirson et al. (Peirson et al., 2018) as a baseline model for a comparative evaluation. The evaluative tasks are explained in detail in the remaining part of the section.

4.1. User Satisfaction

To evaluate user satisfaction levels, we conducted a rating study. We generated a batch of 100 memes for each theme by randomly picking classes within each theme, individually for the baseline model and our model. For each theme, we showed a set of 5 generated memes from our model, 5 generated memes from the baseline model, and 5 original memes, in a mixed order (without revealing the identity of the original and generated memes), to each volunteer. We asked them to rate the generated and original memes from a range of (lowest rating) to (highest rating) based on caption-content, humour and originality. For each theme, we then averaged out the results from the 20 volunteers as shown in Table 3. We observe that the quality of our generated memes is almost on par with the original memes. We also notice that our model outperforms the baseline model, across all themes. Therefore, we maintain this on a holistic level; all of the volunteers were satisfied with the quality of the generated memes as well.

Normie 3.1 2.9 2.7
Savage 3.6 3.6 3.4
Depressing 3.2 3.1 3
Unexpected 3.5 3.3 3.3
Frustrated 3.2 3 2.9
Wholesome 2.8 2.7 2.6
Overall 3.23 3.1 2.98
Table 3. Average ratings of volunteers for original and generated memes (baseline and ours) across themes. ARO represents average ratings for original memes, ARG represents average ratings for generated memes (our model), and ARB represents average ratings for generated memes (baseline model).

4.2. Differentiation Ability of Users

We wanted to understand if the memes generated by our model were significantly different from the ones in the dataset. For this purpose, we generated a separate batch of 50 memes for each theme, randomly picking a class within each theme, for the baseline model and our model (similar to the generation process in Section 4.1). For each theme, we showed 5 generated memes and 5 original memes to volunteers and asked them to classify a meme as generated or original. We then analysed the results of this classification study, plotted confusion matrices for both the baseline model and our model as shown in Figure 4

and reported the results of the individual evaluation metrics as shown in Table


From the confusion matrix of our model, we observe that the volunteers mis-classified generated memes as original memes

of the time. This vouches for the authenticity of our model and shows that the generated memes are qualitatively humorous and genuine enough to confuse the volunteers between original and generated memes.

(a) our model
(b) baseline model
Figure 4. Confusion matrices for the baseline model and our model.
Metric Baseline Our Model
Precision 64.28 56.52
Recall 90 86.66
Accuracy 70 60
F1-score 75.0 68.42
Table 4. Evaluation (based on precision, recall, accuracy and F1-score) of the classification study presented in Section 4.2.

4.3. Theme Recovery

We also wanted to understand if the idea of themes as ‘broad categories’ (explained in Section 2.2) held true across volunteers. To understand this, we hypothesised that if volunteers were correctly able to recover themes from randomly sampled memes, then it would empirically prove the reasoning behind the formation of themes. To acquaint volunteers with the idea of a theme, we showed a set of 10 memes corresponding to each theme from our dataset.

We then generated a batch of memes (from our model) across themes, randomly picking classes (similar to generation processes in Sections 4.1 and 4.2). We asked the volunteers to classify a sample of 20 generated memes into the 6 themes. We report the overall classification accuracy and per-theme classification accuracy averaged out across all the volunteers in Table 5. We notice the high overall classification accuracy which confirms our hypothesis about the existence of themes across meme classes. On further examination, we notice that individually all the themes have very high accuracies apart from the ‘Normie’ theme. We attribute the relatively low accuracy of the ‘Normie’ theme, to the large number of classes (refer Table 5) having low intra-theme similarity as compared to other themes (cf. Figure 1 (a)). Hence, our hypothesis of existence of themes across meme classes is validated.

Theme Accuracy
Normie 77.3
Savage 86.1
Depressing 84.6
Unexpected 90.2
Frustrated 87.7
Wholesome 86.8
Overall 85.5
Table 5. Overall accuracy and per-theme accuracy for the classification study of Section 4.3.

5. Conclusion

In this paper, we explained our meme generation system - Memeify. Memeify is capable of generating memes either from existing classes and themes or from custom images. We also created a large-scale meme dataset consisting of meme captions, classes and themes. We provided an in-depth qualitative analysis on the basis of a user-study. Currently, our work is only limited to memes which have up to two parts in the meme caption. However, we are interested in extending the Memeify system to include multiple parts in the meme caption.


  • P. Berkhin (2006) A survey of clustering data mining techniques. In Grouping Multidimensional Data, Cited by: §2.2.
  • F. Colombo, A. Seeholzer, and W. Gerstner (2017) Deep artificial composer: a creative neural network model for automated melody generation. In International Conference on Evolutionary and Biologically Inspired Music and Art, pp. 81–96. Cited by: §1.
  • R. Dawkins (1976) The selfish gene. Oxford University Press, Oxford, UK. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    Cited by: item 2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In arXiv preprint arXiv:1810.04805, Cited by: §1.
  • A. Gatt and E. Krahmer (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation.

    Journal of Artificial Intelligence Research

    Cited by: §1.
  • A. Gionis, P. Indyk, and R. Motwani (1999) Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA, pp. 518–529. External Links: ISBN 1-55860-615-7, Link Cited by: item 2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, Cited by: §1.
  • S. Goswami (2018) Reddit dank memes dataset. Note: Cited by: §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §3.1.
  • Javascript (1999) Aynschronous javascript and xml. Note: Cited by: item 2.
  • R. Labs (2009) Redis framework. Note: Cited by: item 1.
  • J. Lehman, S. Risi, and J. Clune (2016) Creative generation of 3d objects with deep learning and innovation engines. In Proceedings of the 7th International Conference on Computational Creativity, Cited by: §1.
  • X. Liu, P. He, W. Chen, and J. Gao (2019a) Multi-task Deep Neural Networks for Natural Language Understanding. In arXiv preprint arXiv:1901.11504, Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) RoBERTa: A robustly optimized BERT pretraining approach. CoRR. Cited by: §1.
  • S. Nath (2018) Meme generator dataset. Note: Cited by: §2.1.
  • V. Peirson, L. Abel, and E. M. Tolunay (2018) Dank learning: generating memes using deep neural networks. arXiv preprint arXiv:1806.04510. Cited by: §1, §4.
  • T. P. Projects (2010) Flask framework. Note: Cited by: item 1.
  • QuickMeme (2016) Quick meme website. Note: Cited by: §2.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog. Cited by: §3.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.2, item 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR. Cited by: §3.1.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR. Cited by: §1.