Towards Building an Open-Domain Dialogue System Incorporated with Internet Memes

03/08/2022
by   Hua Lu, et al.
Baidu, Inc.
0

In recent years, Internet memes have been widely used in online chatting. Compared with text-based communication, conversations become more expressive and attractive when Internet memes are incorporated. This paper presents our solutions for the Meme incorporated Open-domain Dialogue (MOD) Challenge of DSTC10, where three tasks are involved: text response modeling, meme retrieval, and meme emotion classification. Firstly, we leverage a large-scale pre-trained dialogue model for coherent and informative response generation. Secondly, based on interaction-based text-matching, our approach can retrieve appropriate memes with good generalization ability. Thirdly, we propose to model the emotion flow (EF) in conversations and introduce an auxiliary task of emotion description prediction (EDP) to boost the performance of meme emotion classification. Experimental results on the MOD dataset demonstrate that our methods can incorporate Internet memes into dialogue systems effectively.

READ FULL TEXT VIEW PDF
09/04/2021

Towards Expressive Communication with Internet Memes: A New Multimodal Conversation Dataset and Benchmark

As a kind of new expression elements, Internet memes are popular and ext...
08/18/2021

Affective Decoding for Empathetic Response Generation

Understanding speaker's feelings and producing appropriate responses wit...
07/28/2019

CAiRE: An End-to-End Empathetic Chatbot

In this paper, we present an end-to-end empathetic conversation agent CA...
06/10/2019

CAiRE_HKUST at SemEval-2019 Task 3: Hierarchical Attention for Dialogue Emotion Classification

Detecting emotion from dialogue is a challenge that has not yet been ext...
08/09/2022

Positively transitioned sentiment dialogue corpus for developing emotion-affective open-domain chatbots

In this paper, we describe a data enhancement method for developing Emil...
09/18/2021

Emily: Developing An Emotion-affective Open-Domain Chatbot with Knowledge Graph-based Persona

In this paper, we describe approaches for developing Emily, an emotion-a...
09/16/2022

Selecting Stickers in Open-Domain Dialogue through Multitask Learning

With the increasing popularity of online chatting, stickers are becoming...

Introduction

As Internet memes can make dialogues more vivid and engaging, nowadays, people tend to incorporate memes when chatting online Kulkarni (2017); Jiang and Vásquez (2020). Despite that Internet memes have become an effective means of expression, they are rarely considered by most open-domain dialogue systems. In DSTC10, the Meme incorporated Open-domain Dialogue (MOD) challenge aims to incorporate Internet memes into open-domain dialogues. It includes the following three tasks: (1) Text Response Modeling: given a multi-modal context, the task here is to generate a coherent and informative text response. (2) Meme Retrieval: given a multi-modal context and a text response, the task aims to retrieve an appropriate meme. (3) Meme Emotion Classification: given a multi-modal context and a text response with a meme, the task here is to predict the emotion type of the Internet meme. Figure 1 shows two examples of conversations in the MOD dataset involving texts, Internet memes, and emotions.

Figure 1: Two examples from the MOD dataset. Corresponding emotion is annotated for each meme in red.

In particular, the test set of MOD is divided into an easy test version and a hard test version. The latter, which contains memes not appearing in the train set, is used to evaluate the generalization ability of the dialogue system. In this work, we introduce the following solutions for the three tasks:

  • In Task1, we leverage a powerful pre-trained open-domain dialogue model for coherent and informative text response generation.

  • In Task2, we represent memes with textual information consisting of meme titles and OCR texts (extracted from the memes). Based on interaction-based text-matching, our approach can retrieve appropriate memes with good generalization ability.

  • In Task3, we propose to model the emotion flow (EF) in conversations and introduce an auxiliary task of emotion description prediction (EDP) to enhance the ability of meme emotion recognition.

Experimental results demonstrate that our methods can effectively incorporate Internet memes into dialogue systems. Our methods achieve first place in four out of six leaderboards and second place in the others with competitive performance.

Methodology

Our detailed solutions towards these three tasks will be discussed in the following.

Text Response Modeling

In open-domain conversation, users are free to talk about any topic, and the system’s replies are expected to meet a high standard in many aspects, including coherence, consistency, informativeness, etc. Incorporated with Internet memes, the dialogue context can be formulated as , where represents the -th utterance text and refers to its associated meme. If no Internet meme is used in the -th utterance, will be denoted as None. The task of text response modeling is to generate the response (i.e., next utterance ) given the multi-modal dialogue context .

As suggested in the MOD baseline (Fei et al., 2021)

, the Internet meme can be represented with visual features extracted by EfficientNet

(Tan and Le, 2019). While in our preliminary experiments, we observe that incorporating memes, whether as visual features or as textual features, brings little benefit to text response generation. The reasons might be two-fold. Firstly, as memes are usually about emotional expressions with little narrative information, the absence of memes might not remarkably undermine a dialogue system’s text response generation ability. Secondly, given that the MOD dataset is collected by inserting memes into existing conversations, the reliance on these memes might be relatively weak for text response generation. Therefore, we treat this task as a standard text-based response generation problem, with memes in the context set as None.

Figure 2: Illustration of network inputs and training objectives for three tasks.

In this paper, we utilize the pre-trained open-domain dialogue model PLATO-2 Bao et al. (2020) for text response generation. As illustrated in Figure 2(a), the input to the network is the concatenation of context and response. The input representation is calculated as the sum of the token, segment, and position embeddings. The network employs flexible attention mechanisms, where bi-directional attention is enabled for better contextual understanding and uni-directional attention is utilized for auto-regressive response generation. The training objective of text response generation is to minimize the following negative log-likelihood (NLL) loss:

(1)

Meme Retrieval

Meme retrieval is a crucial component in meme incorporated dialogue systems. This task is to select an appropriate meme from the Internet meme set, given the multi-modal dialogue context and text response . Formally, we denote Internet meme set as , where is the representation of the -th meme. In this work, we represent memes with textual information, consisting of meme titles and OCR texts (extracted from the memes). Although there exists plenty of vision or multi-modal pre-training works Chen et al. (2019); Li et al. (2020b); Gan et al. (2020); Radford et al. (2021), they are less effective at meme feature extraction due to the gap of data distributions between real photos and devised memes. Experimental results also suggest that the meme title and OCR text can sufficiently represent the meaning of Internet memes.

Therefore, we treat the meme retrieval task as a text-matching problem and employ the cross-encoder architecture for relevance estimation. The network overview for meme retrieval is shown in Figure

2(b). The input includes the dialogue context , the textual response , and one candidate meme

. During training, a pair of positive and negative samples are fed into the network. The output of [CLS] token is passed through a fully-connected layer, and a following sigmoid function to obtain the relevance probability

, where stands for the label to choose meme or not given the dialogue context and corresponding textual response. The training objective is to minimize the following margin ranking loss:

(2)

where is a pre-defined margin parameter, is a positive meme, and

is a negative meme. During training, we enable dynamic random negative sampling, which means that as model training progresses, different sets of negative samples are dynamically sampled in each epoch.

During inference, the Internet meme for the given dialogue context and response is selected as:

(3)

Meme Emotion Classification

Rather than classify the sentiment of an Internet meme, this task aims to predict the meme emotion type situated in the dialogue context. Considering that the emotions of two interlocutors seldom encounter abrupt changes (or to some extent the changes might be traceable), we propose to model the emotion flow (EF) in multi-turn conversations. The textual descriptions of emotions are integrated into the dialogue context. Specifically, the utterance at turn

is composed of the utterance text, meme text, and textual emotion description, i.e., . The meme emotion recognition can be considered as a classical sequence classification task, and the training objective is to minimize the standard cross-entropy loss.

Additionally, we introduce an auxiliary task of emotion description prediction (EDP) to boost meme emotion recognition performance. As shown in Figure 2(c), the auxiliary task is to recover the masked tokens (i.e., textual emotion description in the response) by minimizing the masked language model (MLM) loss Devlin et al. (2018). In this way, the training objective of meme emotion classification is to minimize the following integrated loss:

(4)

where is the cross-entropy loss of the classification task and denotes the MLM loss of the emotion description prediction task.

Experiments

In the DSTC10 MOD challenge, one open-domain dialogue dataset incorporated with Internet memes is constructed. The memes used in the dataset have been annotated with titles. The MOD test set is divided into easy test version and hard test version. The latter one containing unseen memes is used to evaluate the ability of dialogue systems to exploit new Internet memes. Detailed statistics of the dataset are summarized in Table 1.

Settings

The evaluation of the MOD challenge covers the following three tasks:

  • Task1: Text Response Modeling. Given a dialogue context, the model needs to produce a coherent and informative text response. The automatic evaluation metrics of this task include BLEU-2/4

    Papineni et al. (2002), DIST-1/2 Li et al. (2015).

  • Task2: Meme Retrieval. Given a multi-modal dialogue context and a text response, the model needs to retrieve an appropriate Internet meme. The evaluation metrics of this task include Recall_10@1, Recall_10@3, Recall_10@5, and MAP.

  • Task3: Meme Emotion Classification. The model needs to predict the corresponding emotion type of the used Internet meme. The evaluation metrics of this task include Accuracy@1, Accuracy@3, and Accuracy@5.

Table 1: Statistics of the MOD dataset.

Implementation Details In the experiments, we utilize dialogue pre-training models of PLATO-2 Bao et al. (2020) to improve the performance of all three tasks. The models have 32 transformer blocks and 32 attention heads, with up to 1.6 billion parameters. The generation model of PLATO-2 is used for Task1. The evaluation model of PLATO-2 is employed for the fine-tuning of Task2 and Task3.

In Task1, responses are generated using beam search, with a beam size of 5. The maximum sequence length for the context and response is set to 256 and 128, respectively. In Task2, we set the margin parameter to 0.2 and the ratio of positive training samples to negative ones to 1:5. During the fine-tuning of Task2 and Task3, we use Adam Kingma and Ba (2014) optimizer with a learning rate of 2e-5 and warmup steps of 4000. All the models are fine-tuned for five epochs with a batch size of 64. The implementation is based on the PaddlePaddle framework, and the experiments are carried out on 4 Nvidia Tesla V100 GPUs (32G RAM).

Experimental Results

The experimental results on these three tasks are discussed in the following.

Text Response Modeling The evaluation results of text response modeling on two test versions are summarized in Table 2, with the best score written in bold. The final ranking for this task is based on human evaluation, where five metrics are considered: grammatical correctness, informativeness, naturalness, relevance to the dialogue history, and overall feeling based on the above four metrics. The score of each metric ranges from 1 to 5. The higher, the better. The final human score is the average of the above five metric scores. We rank second on the easy version and first on the hard version. From Table 2, it can be observed that our automatic evaluation results are relatively poor, especially on the metrics of BLEU-2/4, while the human evaluation results are relatively competitive. This phenomenon further verifies that the correlation between automatic evaluation metrics and human evaluation is weak in open-domain conversations Liu et al. (2016).

Table 2: Task1 evaluation results on two test versions, with the best score written in bold.
Table 3: Task2 evaluation results on two test versions, with the best score written in bold.
Table 4: Task3 evaluation results on two test versions, with the best score written in bold.

Meme Retrieval The evaluation results of meme retrieval on two test versions are summarized in Table 3, with the best score written in bold. The final ranking for this task is based on the Recall_10@1 score, which is the fraction of the ground-truth Internet meme ranked first among ten meme candidates. Our proposed method obtains first place on both versions and outperforms other teams by a large margin. Possible reasons behind such improvements are discussed as follows:

  • The modality of dialogue context and memes is unified by representing Internet memes with texts, making it easier for the model to estimate the relevance. Furthermore, with the pre-trained language models, our text-matching strategy has the generalization ability to retrieve the unseen memes that are not available during training.

  • Different from the dual-encoder architecture Mazaré et al. (2018); Zang et al. (2021), which performs self-attention over the input and the candidate separately, we employ the cross-encoder architecture to yield rich interactions between the dialogue context and the meme candidate. With this interaction-based text-matching, our model can retrieve appropriate memes more effectively.

Meme Emotion Classification The evaluation results of meme emotion classification on two test versions are summarized in Table 4, with the best score written in bold. The final ranking for this task is based on the Accuracy@1 score, which is the fraction of the ground-truth emotion type that obtains the highest score among all emotion types. We rank first on the easy version and second on the hard version. In particular, our meme emotion classification model, combining the EF and EDP strategies, achieves 4.0% absolute improvement on Accuracy@1 on the easy test set over the second-ranked team (62.3% vs. 58.3%). On the hard test set, the performance of all the models degenerates significantly, revealing that the emotion classification ability for unseen memes is still weak.

Figure 3: Examples of the input (upper) and our output (bottom) for each task.

Case Analysis

To further analyze the performance of our proposed methods, several examples are provided in Figure 3. As shown in the left example, our model is able to generate a coherent and informative response. The interlocutor on the left-hand side seems to be a vegetarian, and our model generates a response consistent with the persona. In the middle and right examples, our model is able to retrieve a relevant meme and accurately identify the emotion type contained in the meme. These dialogue examples suggest that our system can generate a natural and informative response incorporated with an appropriate meme.

Ablation Study

In this section, several ablation studies are carried out on the validation set to better understand the contribution of each component.

Meme Retrieval

To evaluate the generalization ability of our matching model, we select 20 memes to formulate the unseen validation set and remove corresponding samples from the train set. The experimental results on the unseen validation set by the matching models trained on the original train set and the filtered train set are summarized in Table 5. The results indicate that our model has the generalization ability to exploit unseen memes. However, the gap between original and filtered (50.4% vs. 43.6% on Recall_10@1) suggests that there is still some room for improvement on the model’s generalization ability.

Table 5: Task2 ablation study on the unseen validation set.
Table 6: Task3 ablation study on the validation set. EF refers to the modeling of emotion flow. EDP refers to the task of emotion description prediction.

Meme Emotion Classification

To boost the performance of meme emotion classification, we propose to model the emotion flow (EF) in multi-turn conversations and introduce an auxiliary task of the emotion description prediction (EDP). To analyze the effects of these two components, we carry out the ablation studies, and evaluation results are summarized in Table 6. Compared to the base model, the incorporation of EF modeling gives rise to a significant improvement (+2.2% on Accuracy@1). The combination of EF and EDP obtains +2.6% absolute improvement on Accuracy@1 (65.3% vs. 62.7%), verifying the effectiveness of our proposed strategies.

Related Work

In this section, we will discuss related works on multi-modal conversation and emotion recognition.

There are several works that attempt to incorporate multi-modal information into conversations. Das et al. introduces the task of VisDial, where the AI agent needs to hold a meaningful conversation with humans and answer questions about the contents of the input image. In addition to the conversational question answering, there are some other tasks where natural and engaging conversations are conducted based on a shared image, such as image-grounded conversations Mostafazadeh et al. (2017), and image-chat Shuster et al. (2018). Recently, the PhotoChat dataset Zang et al. (2021) is presented, which focuses on the photo-sharing behavior in online messaging and aims to improve the photo-sharing experience in conversations. Unlike the above works concentrated on photos, the MOD dataset incorporates Internet memes into open-domain conversations to enhance communication expressiveness.

In open-domain conversation, it is crucial to recognize the emotional state accurately and generate a response appropriately Rashkin et al. (2018). To boost emotion detection in conversations, HiTrans Li et al. (2020a) utilizes BERT Devlin et al. (2018) as the low-level transformer to generate utterance representations and employs another high-level transformer to obtain context representations. In TUCORE-GCN Lee and Choi (2021)

, the task of emotion recognition is treated as a dialogue-based relation extraction, where a dialogue graph is constructed and a graph convolution network is employed for relation classification. In addition to the text-based conversations, emotion has been widely analyzed in some areas of computer vision, such as facial expression recognition

Minaee et al. (2021), image emotion classification Yang et al. (2017), and so on. In this work, rather than classify the sentiment of an Internet meme, this task aims to predict the meme emotion type situated in the dialogue context.

Conclusion

In this paper, we introduce our solutions for the DSTC10 MOD challenge. Firstly, we leverage a large-scale pre-trained dialogue model for coherent and informative response generation. Secondly, based on interaction-based text-matching, our approach can retrieve appropriate memes with good generalization ability. Thirdly, we propose to model the emotion flow (EF) in conversations and introduce an auxiliary task of emotion description prediction (EDP) to boost the performance of meme emotion recognition. Comprehensive experiments have been conducted on the MOD dataset. Experimental results demonstrate that our methods can effectively incorporate Internet memes into dialogue systems and accurately recognize the meme emotion. Our methods with competitive performance achieve first place in four out of six leaderboards and second place in the others.

Acknowledgments

We would like to thank the anonymous reviewers for their constructive suggestions; Xinxian Huang for the helpful discussions.

References

  • S. Bao, H. He, F. Wang, H. Wu, H. Wang, W. Wu, Z. Guo, Z. Liu, and X. Xu (2020) Plato-2: towards building an open-domain chatbot via curriculum learning. arXiv preprint arXiv:2006.16779. Cited by: Text Response Modeling, Settings.
  • Y. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu (2019) Uniter: learning universal image-text representations. Cited by: Meme Retrieval.
  • A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra (2017) Visual dialog. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 326–335. Cited by: Related Work.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Meme Emotion Classification, Related Work.
  • Z. Fei, Z. Li, J. Zhang, Y. Feng, and J. Zhou (2021) Towards expressive communication with internet memes: a new multimodal conversation dataset and benchmark. arXiv preprint arXiv:2109.01839. Cited by: Text Response Modeling.
  • Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: Meme Retrieval.
  • Y. Jiang and C. Vásquez (2020) Exploring local meaning-making resources: a case study of a popular chinese internet meme (biaoqingbao). Internet Pragmatics 3 (2), pp. 260–282. Cited by: Introduction.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Settings.
  • A. Kulkarni (2017) Internet meme and political discourse: a study on the impact of internet meme as a tool in communicating political satire. Journal of Content, Community & Communication Amity School of Communication 6. Cited by: Introduction.
  • B. Lee and Y. S. Choi (2021) Graph based network with contextualized representations of turns in dialogue. arXiv preprint arXiv:2109.04008. Cited by: Related Work.
  • J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu (2020a) Hitrans: a transformer-based context-and speaker-sensitive model for emotion detection in conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 4190–4200. Cited by: Related Work.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2015) A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055. Cited by: 1st item.
  • W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang (2020b) Unimo: towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409. Cited by: Meme Retrieval.
  • C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: Experimental Results.
  • P. Mazaré, S. Humeau, M. Raison, and A. Bordes (2018) Training millions of personalized dialogue agents. arXiv preprint arXiv:1809.01984. Cited by: 2nd item.
  • S. Minaee, M. Minaei, and A. Abdolrashidi (2021) Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21 (9), pp. 3046. Cited by: Related Work.
  • N. Mostafazadeh, C. Brockett, B. Dolan, M. Galley, J. Gao, G. P. Spithourakis, and L. Vanderwende (2017) Image-grounded conversations: multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251. Cited by: Related Work.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: 1st item.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: Meme Retrieval.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2018) Towards empathetic open-domain conversation models: a new benchmark and dataset. arXiv preprint arXiv:1811.00207. Cited by: Related Work.
  • K. Shuster, S. Humeau, A. Bordes, and J. Weston (2018) Image chat: engaging grounded conversations. arXiv preprint arXiv:1811.00945. Cited by: Related Work.
  • M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    .
    In

    International Conference on Machine Learning

    ,
    pp. 6105–6114. Cited by: Text Response Modeling.
  • J. Yang, D. She, and M. Sun (2017)

    Joint image emotion classification and distribution learning via deep convolutional neural network.

    .
    In IJCAI, pp. 3266–3272. Cited by: Related Work.
  • X. Zang, L. Liu, M. Wang, Y. Song, H. Zhang, and J. Chen (2021) PhotoChat: a human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453. Cited by: 2nd item, Related Work.