Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark

09/04/2021 ∙ by Zhengcong Fei, et al. ∙ 0

As a kind of new expression elements, Internet memes are popular and extensively used in online chatting scenarios since they manage to make dialogues vivid, moving, and interesting. However, most current dialogue researches focus on text-only dialogue tasks. In this paper, we propose a new task named as Meme incorporated Open-domain Dialogue (MOD). Compared to previous dialogue tasks, MOD is much more challenging since it requires the model to understand the multimodal elements as well as the emotions behind them. To facilitate the MOD research, we construct a large-scale open-domain multimodal dialogue dataset incorporating abundant Internet memes into utterances. The dataset consists of ∼45K Chinese conversations with ∼606K utterances. Each conversation contains about 13 utterances with about 4 Internet memes on average and each utterance equipped with an Internet meme is annotated with the corresponding emotion. In addition, we present a simple and effective method, which utilizes a unified generation network to solve the MOD task. Experimental results demonstrate that our method trained on the proposed corpus is able to achieve expressive communication including texts and memes. The corpus and models have been publicly available at



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Internet memes have become one of the most important approaches for expression and emotions in social media Wang et al. (2019). Compared with text-only communication, dialogues become more expressive and vivid when Internet memes are incorporated. Meanwhile, there are many implicit and strong emotions delivered by Internet memes. Resultingly, the use of Internet memes in online chats has become increasingly popular Chen (2020); Beskow et al. (2020). Even though there is an increasing interest in chatbots that can converse with humans using multiple modalities, incorporating contextualized Internet memes into multi-turn open-domain dialogues under diverse situations is still far from satisfactory.

Figure 1: Illustrations of meme incorporated open-domain dialogues. Both history and response can be in the form of text-only, meme-only, or a combination of both. Corresponding emotion is annotated for each used Internet meme in red.

A variety of related works have been proposed to explore the multimodal information in dialogues. The first group of work is emoji prediction Xie et al. (2016); Barbieri et al. (2017, 2018) according to the chatting context. The emojis are limited in variety and of small size, while memes are more expressive and constantly evolving. The second group is retailing multimodal dialogue, which is task-oriented and the utilized images are the products on sale Liao et al. (2018); Nie et al. (2019); Cui et al. (2019). The models developed for this task often focus on specific aspects and then generating the correct answers to questions, which fails to model those important aspects for dialogue such as emotion analysis. Another group of research works investigate visually grounded dialogue Das et al. (2017); Haber et al. (2019), which uses natural language to talk about visual content given in advance. However, the mode of conversation there is limited to only text Le et al. (2019).

As a step towards expressive open-domain conversational AI, we introduce a new task – Meme incorporated Open-domain Dialogue (MOD), along with a large-scale Chinese multimodal conversation dataset, which facilitates the multimodal conversation modeling and emotion analysis in multi-turn dialogues. Specifically, provided with a multimodal dialogue context, the MOD task aims to generate a vivid response in text-only, meme-only, or mixed information, which can be considered a general paradigm compared with conventional dialogue tasks. Our dataset consists of 45K human-human open-domain Chinese conversations between two participants. Each meme in the conversation is annotated with the corresponding emotion, which can be used as the supervision information for emotion analysis modeling. In particular, we provide a hard version of testing set which contains memes not appeared during training to evaluate the generalization ability of model comprehensively. Examples of conversations in our dataset involving both text and Internet memes are shown in Figure 1.

To showcase how the new dataset may be exploited, we propose a simple and effective model, which utilizes a unified network to generate multimodal responses. Practically, we pool all sub-tasks like text generation and Internet meme prediction into a general sequence generation procedure, and solve them with a language model architecture. It avoids designing complex systems for individual components separately, and all resulting sub-tasks can be covered simultaneously. Experimental results highlight that our proposed model trained on the new corpus can create reasonable combined responses and show promising to develop better models. It is our hope that the introduction of this task will spark a new interest in multimodal open-domain conversation modeling. The main contributions of this paper are summarized as follows:

  • We introduce a new task where a dialogue system is required to incorporate lively Internet memes into open-domain conversations towards expressive communication.

  • We build a large-scale Chinese dialogue dataset, which contains abundant Internet memes and emotion annotations. The corpus can empower the researches of not only MOD generation but also emotion modeling.

  • To further illustrate the dataset’s potential, we propose a simple and effective model for reference and conduct extensive experiments. Results show that our method can generate rational multimodal responses, while there is still much room for further research.

2 Related Work

2.1 Internet Meme in Dialogue

Multiple expression forms, including emoji, sticker, and meme, have become prevalent along with the development of online chats. Among them, the meme is a type of content that features a visual format of images, GIF, or short videos, which can inject humor into conversations and create an emotional context Posey et al. (2010). Compared with emoji which is restricted at a fixed size, Barbieri et al. (2018, 2017), Internet meme is more expressive and of a great variety. One similar work to ours is sticker recommendation Jesus et al. (2019); Laddha et al. (2020); Gao et al. (2020), where suitable stickers are retrieved to match the text-only dialogue history, which can be regarded as a special case of MOD. Besides, the tasks of meme retrieval Milo et al. (2019); Perez-Martin et al. (2020); Sharma et al. (2020), detecting the hate speech in memes and clustering memes according to events Miliani et al. (2020); Kiela et al. (2020) are proposed to help the Internet meme modeling. In this work, we focus on a more challenging situation, generating Internet meme merged utterances to make conversations more vivid and engaging.

2.2 Multimodal Dialogue Datasets

More researches on dialogue systems have shifted towards integrating more modalities, such as image, audio, and video, to build the informative conversation interaction. The datasets reported in Das et al. (2017); Mostafazadeh et al. (2017) make contributions to bridge the gap between vision and language. In the VisDial Das et al. (2017); Kamezawa et al. (2020), a system is required to answer the questions about an input image given the dialogue history. And the AVSD dataset AlAmri et al. (2019) has been used for response generation with audio and visual features Li et al. (2020). But in our dataset, the Internet memes are randomly distributed in conversations rather than as background knowledge. The Multimodal Dialogue (MMD) dataset Saha et al. (2018), which includes on the fashion domain with the information from both images and texts, has further facilitated the researches on the multimodal dialogues. While datasets resulting from the above tasks provide opportunities to explore multimodal dialogues, they are more concerned about question answering and limited by task-oriented scenarios. On the contrary, the MOD includes Internet memes usage in natural conversations and additional emotion modeling.

3 Task and Data

3.1 Meme incorporated Open-domain Dialogue (MOD) task

Provided with the dialogue history consisting of utterances filled with Internet memes, the dialogue system aims to build an interesting response in the form of text-only, meme-only, or a mixed category of both. Formally, we use U to denote a turns of Internet meme incorporated dialogue, where utterance includes the text context and Internet meme pair. Note that =None denotes there is no Internet meme incorporated in -th utterance and denotes that no text is generated as the response. Therefore, the goal of MOD is to predict the target response for the given dialogue history as:


We further split the current scope of MOD into the following three consecutive sub-steps: (1) Text Response Modeling: given the multimodal history context , the task aims to generate a coherent and natural text response . (2) Meme Usage Prediction: given a multimodal context of several dialogue turns and generated text response , the task here is to decide whether to involve an Internet meme into response, which can be considered as a binary classification. (3) Meme Retrieval: given a multimodal historical context and generated text response , the goal here is to select a suitable meme as feedback.

3.2 Data Collection

In this section, we describe the construction of our proposed multi-turn meme-incorporated dialogue dataset in detail.

Step 1: Pre-processing.

For Internet meme sets, the meme candidates are firstly collected from the Internet and then chosen carefully by annotators to maintain good quality. In addition, if textual information appears in the selected Internet meme content, we will also annotate it manually. To avoid the model only utilizing the textual information and ignoring visual features, we control the proportion of memes without appeared texts in the final set to 40%. Meanwhile, to avoid multiple appropriate memes being selected under one dialogue condition, we filter out the memes with highly similar or duplicate semantic content. Finally, we obtain a total of 307 Internet memes for the subsequent data annotating process. To facilitate the arrangement and annotating process, the Internet meme set is further split into four groups: atmosphere adjustment, basic expression, basic emotion, and common semantics, respectively.

For the initial conversation set, considering that the open-domain Internet meme is too scarce in scale, it is costly and time-consuming to collect multi-turn conversations from scratch. Thus, our annotation is based on an existing large-scale Chinese dialogue dataset with its large version Wang et al. (2020). To make each chatting session contain rich information, we remove the dialogues which have less than utterances.

Step 2: Internet meme incorporated response construction.

The annotators, who are well-educated and familiar with dialogue research, are tasked to take two operations using the prepared Internet meme candidates: use one most suitable Internet meme to replace part of the text conversation or insert an Internet meme into the utterance to enhance the emotion of the current dialogues. In particular, we also ask annotators to label the emotional states when utilizing the current Internet memes. The annotators are specially instructed based on the following criteria: () behave naturally, and the meme usage is in line with real daily chats, () the number of different Internet memes in the dataset is kept balanced to avoid meaningless gatherings and biased data. Those dialogues without any labeled Internet memes will be abandoned in the later data processing stage. Note that different from previous works, our annotation procedure is conducted posteriorly so that it will not interfere with human conversations, e.g., prompting them to an overused Internet meme.

Dataset Statistics size
# dialogues (chat sessions) 45,174
# utterances 606,014
# tokens 5,339
# Internet memes 307
Avg. # of utterances in a dialogue 13.42
Avg. # of Internet memes in a dialogue 4.06
Avg. # of tokens in an utterance 11.46
Table 1: Statistics of the MOD dataset.
Train Valid Easy test Hard test
# dialogues 41,644 1,000 1,000 1,530
# utterances 558,181 13,666 13,999 20,358
# tokens 5,249 2,724 2,782 3,166
# Internet memes 274 274 274 307
Table 2: The split statistics of the MOD dataset.
Figure 2: Internet meme frequency in the dataset. The meme usage balances without significant bias. Meme ids greater than 274 only occur in hard test set.
Figure 3: Histogram of top-10 annotated emotions when memes are used. Positive emotions (pink) occur significantly more often than negative emotions (blue).

Step 3: Quality control.

Before formal annotation, annotators are asked to annotate training samples until their results pass our examination. During the annotation, to eliminate the subjective inconsistency and make the annotation reliable, several specialized workers consistently monitor the collected dialogue data and perform a periodic quality check on samples. After the checking, we sample 10% data and manually check the samples ourselves. If errors are found in an annotation batch, we ask corresponding annotators to self-check and re-annotate the whole batch. In light of the above, the annotation results are closed to real-world natural conversations.

3.3 Dataset Statistics and Analysis

The total detailed statistics of the MOD dataset are summarized in Table 1. MOD dataset has an average of 13.42 turns, and each turn contains 11.46 tokens. The text is tokenized by a Chinese BERT tokenizer Wang et al. (2020). We also plot the usage frequency of Internet memes and corresponding emotion in Figures 2 and 3, respectively. To validate the MOD performance in this work, the final dialogue dataset is divided into training, validation, and testing. The split is based on dialogues, not source-target pairs, and the split statistics are summarized in Table 2. In particular, the test set is divided into an easy version for all Internet memes seen in the training set and a hard version for unseen Internet memes. The motivation to build a hard version is to evaluate whether a MOD model is able to be transferred to exploit new Internet memes. The phenomenon is common in online chats because a limited candidate cannot handle all cases in real situations.

Dataset Type History Response Size
VisDial (2017) task image+text text 120K
MMD (2018) task image+text image+text 150K
AVSD (2019) task audio+video+text text 11K
SRS (2020) open text sticker 340K
MOD open meme+text meme+text 45K
Table 3: Comparison with other multimodal dialogue datasets.
Figure 4: Overview of MOD-GPT. The proposed model consists of an Internet meme embedder, a text embedder, and a multi-layer Transformer Decoder. The model is optimized with several training tasks.

To the best of our knowledge, our dataset is novel in the sense that we explicitly guide the annotators to participate in the process of creating engaging and informative MOD. Table 3 illustrates the comparison between existing multimodal dialogue datasets, and our MOD dataset focuses more on dialogue learning and emotion modeling. Considering that properly annotated MOD data is still scarce, our dataset will continue to enrich in the future.

4 Mod-Gpt

Using the human-annotated dataset, we develop a simple and effective model that aims at incorporating Internet memes into open-domain dialogues. Figure 4 presents the overall multi-task training framework of our model.

4.1 Model Overview

In our method, the response generation is modeled as a language modeling problem. Corresponding, the multimodal response can be produced in order:


Practically, we employ a multi-layer Transformer decoder architecture Vaswani et al. (2017)

to build the probability of output.

Text Input. For texts, we follow CDial-GPT Wang et al. (2020) and tokenize the input sentences with a Chinese BERT tokenizer.

Meme Input. For Internet memes, we first use pre-trained EfficientNet Tan and Le (2019) to extract visual features. Then, all meme features are fed through a fully-connected layer to be projected into the same embedding space as text embedding.

To make our model obtain the ability to distinguish among the different part of input (text, meme, user1, and user2) and make use of the order of sequence, the final representation for each token is obtained by summing up its feature embedding, position embedding, and segment embedding.

4.2 Mixed Response Generation

When generating Internet meme incorporated response, there exist three problems: () what text context to respond (text generation), () whether to utilize an Internet meme at current respond (binary classification) and () which Internet meme to be selected (meme selection).

For the first text response modeling, we formulate with conventional conditional probability and optimizes the negative log-likelihood loss function on training dataset



where is the -th word in -th text response and corresponds to the length of sentence.

We address the latter two problems in a uniform way in this work. Suppose the original token vocabulary is , we extend it with an extra token “[tag]”, which will be inserted after token “[eos]” at each response. During the inference, the model can decide to either retrieve an Internet meme to be inserted from the set or not to use Internet memes. By utilizing this special token, the questions are merged. For the meme usage prediction, we utilize a one-layer MLP to project hidden states into two type distribution and optimize with cross-entropy loss as:


where for each utterance , if = None, then ; else . For predicting meme features, unlike textual tokens, which are represented as discrete labels, meme features are high-dimensional and continuous. Instead of clustering meme features to discrete labels, we adopt the meme feature regression method following Li et al. (2020). In particular, we apply another one-layer MLP to transform the hidden states

to a vector of the same dimension as target meme feature

and optimize the model with the L loss as:


4.3 Learning Objective

For meme usage prediction and meme selection, both auxiliary tasks are trained with the main text generation loss together, which can be regarded as the multi-task learning task. The loss function of multi-task learning consists of text response loss , meme usage prediction loss and meme selection loss . In this way, the total loss function of our model can be computed as follows:


where and are hyper-parameters that work as scale factors.

4.4 Pre-training Tasks

Internet Meme Feature Extraction.

Existing convolutional neural networks (CNN), including EfficientNet, are mostly built on real-world photos. Thus, directly applying these networks on Internet memes to extract features is not feasible. In the dataset

Gao et al. (2020), each sticker is given an emoji tag, which denotes its general emotion. Considering the relevance between Internet memes and stickers, we adopt a pre-training classification task to help the model understand memes effectively. In particular, we utilize the features output from the CNN to predict which emoji is attached to the corresponding sticker. An extra MLP layer is integrated, and the cross-entropy loss is used as the optimization function.

Cross-modal Emotion Modeling.

The initial parameters of our MOD-GPT model are loaded from the model trained on the text-only Chinese corpus, which lacks cross-modal knowledge building. Thus, we utilize extra emotion labels contained in our dataset and introduce an emotion analysis to help model better handle Internet meme content. Specifically, given the conversation history, the system aims to predict the emotion labels when utilizing the Internet meme in the last utterance, which can also be regarded as a classification problem. Note that we re-sample top-100 emotion annotations to avoid the training bias.

5 Experiments

5.1 Baseline Settings

As discussed earlier, the MOD task has under-investigated so far, and there are few existing baselines for our comparison. Because of the relevance with our MOD task, we select to compare with the following models:

  • SRS (text history and meme response) Gao et al. (2020) learns the representation of each utterance using a self-attention mechanism and extracts meme representation by CNN. A deep interaction network is employed to fully model the dependency between the sticker and utterances to obtain the target meme.

  • CDial-GPT (text history and text response) Wang et al. (2020) is a 12-layer Transformer Decoder under the setting of DialoGPT for Chinese conversation generation.

  • MHERAD Saha et al. (2018) is a multimodal hierarchical encoder-decoder model that incorporates the visual features into the basic HRED model and achieves a promising performance for task-oriented dialogue in the retailing domain.

5.2 Implementation Details

Since our dataset is multi-turn, we take every sentence in the dialogue from the second sentence to the last sentence as the response of dialogue history. We implemented the MOD-GPT model with Pytorch under the HuggingFace framework

Wolf et al. (2019). The Transformer configure parameter is identical to the base version of CDial-GPT. ADAM Kingma and Ba (2015) was used to optimize with the initial learning rate of 510. We set both and

to 1. The context length is truncated to be 500. The validation set was used for hyper-parameter tuning. All MLP networks used ReLU activation with dropout and one hidden layer. Note that for a fair comparison, all the baselines and our model adopt EfficientNet-b0

Tan and Le (2019) as the basic architecture of meme feature extractor. All text responses are generated using Nucleus sampling scheme Holtzman et al. (2020) with a threshold 0.9 and temperature 0.7.

5.3 Evaluation Metrics

For the text response generation, we use perplexity, which measures the language quality of the generated response. We also report BLEU-2,4 Papineni et al. (2002) and distinct -gram Li et al. (2016) scores, which measure the similarity between the generated responses and ground-truth via -gram matching and the number of distinct -gram in generated responses. For the Internet meme selection, we use R@

as the evaluation metrics where

is set to 1,2,5, and the model prediction is considered to be correct only if the true response is among the top- entries in the ranked list of candidates.

5.4 Quantitative Results

Evaluating the Text Response.

The performance of the baselines and MOD-GPT in textual response generation on two test versions are summarized in Table 4. We have the following observations: 1) MOD-GPT surpasses the MHERAD regarding all the evaluation scores, demonstrating that the usefulness and superiority of a unified transformer-based structure for text generation compared with hierarchical LSTM. 2) CDial-GPT, which holds the same network architecture while only access to the text history, achieves lower results of evaluation metrics. In contrast, our MOD-GPT benefits from the Internet meme in a historical context to some extent. We attribute this to the rich Internet meme that contains useful information for dialogue systems to infer answers.

Perplexity B-2 B-4 Dist-1 Dist-2

easy test
MHERAD 31.50 2.10 0.46 1.32 13.75
CDial-GPT 19.32 5.63 1.15 1.53 18.62
MOD-GPT 19.18 6.06 1.88 1.61 19.25
hard test
MHERAD 32.27 2.05 0.44 1.40 14.43
CDial-GPT 19.68 5.54 1.18 1.68 20.22
MOD-GPT 19.41 5.98 1.76 1.80 21.65
Table 4: Performance (%) of the models on the “text response modeling” task on the MOD test set.

Evaluating the Internet Meme Usage.

For the meme usage prediction task, the binary classification accuracy scores of the two test versions are 89.5% and 86.2%, respectively. This shows that MOD-GPT performs well in whether Internet Meme is included in both test sets.

For the meme retrieval task, we only consider those dialogue turns ending in an Internet meme response from the system, and the model is provided with target Internet meme in which one is relevant, and the others are randomly sampled. Model has to rank the Internet memes in order of their relevance given the multimodal context. We report the performance evaluation in Table 5 and obtain the following findings: 1) MOD-GPT achieves the best performance in this task. In particular, the R score of MOD-GPT approaches to 0.83. In contrast, SRS only calculates the similarity between the textual features of the chatting context and the visual features of memes, which results in a degraded performance. 2) As we expected, the unified Transformer-based model holds better retrieval capability compared with the hierarchical LSTM-based model MHERAD. 3) The performance of the hard version of test set, especially for the unseen memes, is worse than the corresponding easy version, reveals that the hard version is more challenging and the generalization ability for various memes is expected. 4) When candidate size increases up to total meme sets, the recall score decreases a lot, which demonstrates the future promising improvements.

R@1 R@2 R@5 R@5

easy test
MHERAD 44.1 59.6 80.3 25.1
SRS 46.2 61.9 81.1 -
MOD-GPT 52.3 64.3 83.6 32.3
hard test (seen)
MHERAD 43.8 59.0 78.5 23.8
SRS 44.5 60.8 80.4 -
MOD-GPT 52.4 63.0 81.7 31.5
hard test (unseen)
MHERAD 35.3 43.8 60.1 16.6
SRS 37.0 45.8 63.2 -
MOD-GPT 45.4 51.2 70.5 27.0
Table 5: Performance (%) of the models on the “meme retrieval” task in the MOD test set. The evaluation results of the target Internet memes appear in training set (seen) and do not appear in the training set (unseen) are listed separately.
Figure 5: Examples of Internet meme incorporated dialogues produced by the MOD-GPT (System) and humans (User). All used memes are labeled with corresponding emotions by humans for the convenience of understanding.
Figure 6: Examples of the attention weights of the previous dialogue utterance when predicting the target meme with annotated emotion “shy” in red.

5.5 Case Study

Several interactive cases generated by MOD-GPT and humans are provided in Figure 5

. These dialogue samples suggest that our proposed method holds the capacity to provide Internet meme incorporated expressive communication. Besides, according to past studies about Transformer networks

Kovaleva et al. (2019), the highest layers of a Transformer model mainly encode task-specific features for predictions. Thus we rely on the mean of all the attention maps on the last layer of the MOD-GPT model to represent the token-level interrelations corresponding to each generated Internet memes, the attention score between the “[tag]” token and each other historical token reflects the token’s contribution to the prediction. We can view that the MOD-GPT model always gives a higher attention weight on the emotional adjectives, such as “sorry” in Figure 6.

5.6 Ablation Study

There are two pre-training tasks in the proposed multimodal dialogue framework. In order to better understand the contribution of each task, we carried out an ablation study for detailed analysis. As shown in Table 7

, we find that both introduced pre-training tasks help to boost response generation quality partly. Especially, integrating the Internet meme feature extraction (IMFE) training task helps the model better understand Internet memes, which leads to

+5.1% R@1 score improvement compared to original settings. Using cross-modal emotion modeling (CEM) can also bring a remarkable improvement.

B-2 Dist-1 R@1 R@5

6.06 1.61 52.3 83.6
-IMFE 5.95 1.35 47.2 79.5
-CEM 5.82 1.27 48.4 80.2
Table 6: Performance (%) of ablation study on the MOD easy test set. Both pre-training tasks have an impact on improving performance.

6 Conclusion

As Internet memes widely propagate through social networks, an interesting research direction for future work is incorporating Internet memes into the open-domain dialogue to make conversations vivid and engaging. In this paper, we take a step in this direction by: () the release of the MOD dataset that focuses on a suitable form of responses according to multimodal historical context with abundant emotion labeling, () a demonstration of a neural conversation system that can generate Internet meme incorporated dialogues under a simple and effective framework. We believe that the ability to deal with the MOD task can serve as an important testbed for measuring progress toward multimodal open-domain dialogue intelligence.


  • H. AlAmri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh (2019) Audio visual scene-aware dialog. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019

    pp. 7558–7567. External Links: Link, Document Cited by: §2.2, Table 3.
  • F. Barbieri, M. Ballesteros, F. Ronzano, and H. Saggion (2018) Multimodal emoji prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 679–686. External Links: Link, Document Cited by: §1, §2.1.
  • F. Barbieri, M. Ballesteros, and H. Saggion (2017) Are emojis predictable?. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, M. Lapata, P. Blunsom, and A. Koller (Eds.), pp. 105–111. External Links: Link, Document Cited by: §1, §2.1.
  • D. M. Beskow, S. Kumar, and K. M. Carley (2020)

    The evolution of political memes: detecting and characterizing internet memes with multi-modal deep learning

    Inf. Process. Manag. 57 (2), pp. 102170. External Links: Link, Document Cited by: §1.
  • C. Chen (2020) Research on sticker cognition for elderly people using instant messaging. In Cross-Cultural Design. User Experience of Products, Services, and Intelligent Environments - 12th International Conference, CCD 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19-24, 2020, Proceedings, Part I, P. P. Rau (Ed.), Lecture Notes in Computer Science, Vol. 12192, pp. 16–27. External Links: Link, Document Cited by: §1.
  • C. Cui, W. Wang, X. Song, M. Huang, X. Xu, and L. Nie (2019) User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, and F. Scholer (Eds.), pp. 445–454. External Links: Link, Document Cited by: §A.2, §1.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1080–1089. External Links: Link, Document Cited by: §1, §2.2, Table 3.
  • S. Gao, X. Chen, C. Liu, L. Liu, D. Zhao, and R. Yan (2020) Learning to respond with stickers: A framework of unifying multi-modality in multi-turn dialog. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, Y. Huang, I. King, T. Liu, and M. van Steen (Eds.), pp. 1138–1148. External Links: Link, Document Cited by: §2.1, Table 3, §4.4, 1st item.
  • J. Haber, T. Baumgärtner, E. Takmaz, L. Gelderloos, E. Bruni, and R. Fernández (2019) The PhotoBook dataset: building common ground through visually-grounded dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1895–1910. External Links: Link, Document Cited by: §1.
  • W. He, Z. Li, D. Lu, E. Chen, T. Xu, B. Huai, and J. Yuan (2020) Multimodal dialogue systems via capturing context-aware dependencies of semantic elements. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, C. W. Chen, R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann (Eds.), pp. 2755–2764. External Links: Link, Document Cited by: §A.2.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §5.2.
  • I. Jesus, J. Cardoso, A. J. G. Busson, Á. L. V. Guedes, S. Colcher, and R. L. Milidiú (2019) A cnn-based tool to index emotion on anime character stickers. In IEEE International Symposium on Multimedia, ISM 2019, San Diego, CA, USA, December 9-11, 2019, pp. 319–322. External Links: Link, Document Cited by: §2.1.
  • H. Kamezawa, N. Nishida, N. Shimizu, T. Miyazaki, and H. Nakayama (2020) A visually-grounded first-person dialogue dataset with verbal and non-verbal responses. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020

    , B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.),
    pp. 3299–3310. External Links: Link, Document Cited by: §2.2.
  • D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Testuggine (2020) The hateful memes challenge: detecting hate speech in multimodal memes. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §5.2.
  • O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019) Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 4364–4373. External Links: Link, Document Cited by: §5.5.
  • A. Laddha, M. Hanoosh, D. Mukherjee, P. Patwa, and A. Narang (2020) Understanding chat messages for sticker recommendation in messaging apps. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020

    pp. 13156–13163. External Links: Link Cited by: §2.1.
  • H. Le, D. Sahoo, N. F. Chen, and S. C. H. Hoi (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 5612–5623. External Links: Link, Document Cited by: §1.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow (Eds.), pp. 110–119. External Links: Link, Document Cited by: §5.3.
  • Z. Li, Z. Li, J. Zhang, Y. Feng, C. Niu, and J. Zhou (2020) Bridging text and video: A universal multimodal transformer for video-audio scene-aware dialog. CoRR abs/2002.00163. External Links: Link, 2002.00163 Cited by: §2.2, §4.2.
  • L. Liao, Y. Ma, X. He, R. Hong, and T. Chua (2018) Knowledge-aware multimodal dialogue systems. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22-26, 2018, S. Boll, K. M. Lee, J. Luo, W. Zhu, H. Byun, C. W. Chen, R. Lienhart, and T. Mei (Eds.), pp. 801–809. External Links: Link, Document Cited by: §A.2, §1.
  • M. Miliani, G. Giorgi, I. Rama, G. Anselmi, and G. E. Lebani (2020) DANKMEMES @ EVALITA 2020: the memeing of life: memes, multimodality and politics. In Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online event, December 17th, 2020, V. Basile, D. Croce, M. D. Maro, and L. C. Passaro (Eds.), CEUR Workshop Proceedings, Vol. 2765. External Links: Link Cited by: §2.1.
  • T. Milo, A. Somech, and B. Youngmann (2019) SimMeme: A search engine for internet memes. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pp. 974–985. External Links: Link, Document Cited by: §2.1.
  • N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. P. Spithourakis, and L. Vanderwende (2017) Image-grounded conversations: multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, G. Kondrak and T. Watanabe (Eds.), pp. 462–472. External Links: Link Cited by: §2.2.
  • L. Nie, W. Wang, R. Hong, M. Wang, and Q. Tian (2019) Multimodal dialog system: generating responses via adaptive decoders. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, and W. T. Ooi (Eds.), pp. 1098–1106. External Links: Link, Document Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §5.3.
  • J. Perez-Martin, B. Bustos, and M. Saldaña (2020) Semantic search of memes on twitter. CoRR abs/2002.01462. External Links: Link, 2002.01462 Cited by: §2.1.
  • C. Posey, P. B. Lowry, T. L. Roberts, and T. S. Ellis (2010) Proposing the online community self-disclosure model: the case of working professionals in france and the U.K. who use online communities. Eur. J. Inf. Syst. 19 (2), pp. 181–195. External Links: Link, Document Cited by: §2.1.
  • A. Saha, M. M. Khapra, and K. Sankaranarayanan (2018) Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 696–704. External Links: Link Cited by: §A.2, §2.2, Table 3, 3rd item.
  • C. Sharma, D. Bhageria, W. Scott, P. Srinivas, A. Das, T. Chakraborty, V. Pulabaigari, and B. Gambäck (2020) SemEval-2020 task 8: memotion analysis- the visuo-lingual metaphor!. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, Barcelona (online), December 12-13, 2020, A. Herbelot, X. Zhu, A. Palmer, N. Schneider, J. May, and E. Shutova (Eds.), pp. 759–773. External Links: Link Cited by: §2.1.
  • M. Tan and Q. V. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks


    Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Link Cited by: §4.1, §5.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §4.1.
  • Y. Wang, P. Ke, Y. Zheng, K. Huang, Y. Jiang, X. Zhu, and M. Huang (2020) A large-scale chinese short-text conversation dataset. In Natural Language Processing and Chinese Computing - 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14-18, 2020, Proceedings, Part I, X. Zhu, M. Zhang, Y. Hong, and R. He (Eds.), Lecture Notes in Computer Science, Vol. 12430, pp. 91–103. External Links: Link, Document Cited by: §3.2, §3.3, §4.1, 2nd item.
  • Y. Wang, Y. Li, X. Gui, Y. Kou, and F. Liu (2019) Culturally-embedded visual literacy: A study of impression management via emoticon, emoji, sticker, and meme on social media in china. Proc. ACM Hum. Comput. Interact. 3 (CSCW), pp. 68:1–68:24. External Links: Link, Document Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771. External Links: Link, 1910.03771 Cited by: §5.2.
  • R. Xie, Z. Liu, R. Yan, and M. Sun (2016) Neural emoji recommendation in dialogue systems. CoRR abs/1612.04609. External Links: Link, 1612.04609 Cited by: §1.

Appendix A Appendix

a.1 Ethical Considerations

The original copyright of all the conversations belongs to the source owner that is public to academic use. The Internet meme sets are freely accessible online. The copyright of annotation belongs to our group, and they will be free released to the public. By consulting the legal advisor, the MOD dataset is freely accessible online to academic use. Without permission, it may not be used for any commercial purposes and distributed to others.

Our data construction involves manual annotation. The annotated conversation corpus and Internet meme set do not contain personal sensitive information. We asked annotators to incorporate Internet memes limited to given dialogues, and not to include any personal information. The annotators got reasonable salary for their annotation work.

a.2 Technical Difference with Other Multimodal Dialogue Models

Apart from related multimoda dialogue tasks and datasets, we also discuss the technical difference between the preceding multimodal dialogue models. Saha et al. (2018) introduces a hierarchical structure, which first uses a multimodal encoder to extract the image and text features, and then adopts high-level RNN to model historical dialogue information, which is also referred to as a baseline in our paper. Cui et al. (2019) propose an adaptive decoder, which first determines whether the reply is in the form of text or image before decoding, and then generates the corresponding response. In Liao et al. (2018)

, a chat session is modeled as a reinforcement learning procedure, and a reward is formed to optimize the answer.

He et al. (2020) further consider the influence of the order of historical information images and text information on answers with a self-attention block. Comparatively, we unify the text generation and meme prediction into a long sequence procedure and solve them with a cross-modal GPT-based language model.

a.3 Emotion Analysis

To clarify the capability of our MOD-GPT model in emotion prediction, we illustrate the performance over top-5 classification accuracy metrics in MOD test set in Table 7. As show in the Table, we can see that our method can predict the emotion states of Internet meme usage effectively. This proves that it is reasonable to exploit the conversation emotion for boosting Meme incorporated open-domain dialogue modeling.

Emotion Classification Accuracy
easy test 75.8
hard test 72.3
Table 7: Performance (%) of the MOD-GPT on the “emotion analysis” task in the MOD test set.