Designing dialogue systems: A mean, grumpy, sarcastic chatbot in the browser

In this work we explore a deep learning-based dialogue system that generates sarcastic and humorous responses from a conversation design perspective. We trained a seq2seq model on a carefully curated dataset of 3000 question-answering pairs, the core of our mean, grumpy, sarcastic chatbot. We show that end-to-end systems learn patterns very quickly from small datasets and thus, are able to transfer simple linguistic structures representing abstract concepts to unseen settings. We also deploy our LSTM-based encoder-decoder model in the browser, where users can directly interact with the chatbot. Human raters evaluated linguistic quality, creativity and human-like traits, revealing the system's strengths, limitations and potential for future research.


page 1

page 2

page 3

page 4


End-to-end optimization of goal-driven and visually grounded dialogue systems

End-to-end design of dialogue systems has recently become a popular rese...

A Neural Question Answering System for Basic Questions about Subroutines

A question answering (QA) system is a type of conversational AI that gen...

Adversarial Evaluation of Dialogue Models

The recent application of RNN encoder-decoder models has resulted in sub...

Flexible End-to-End Dialogue System for Knowledge Grounded Conversation

In knowledge grounded conversation, domain knowledge plays an important ...

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Much of human dialogue occurs in semi-cooperative settings, where agents...

MedDG: A Large-scale Medical Consultation Dataset for Building Medical Dialogue System

Developing conversational agents to interact with patients and provide p...

Code Repositories

1 Introduction

For many years artificial intelligence researchers have been investigating how to design and build machines that are not only able to understand and reason, but to perceive and express emotions

Turing (2009); Picard (1995)

. A more recent stream of NLP and machine learning research is dedicated to generative systems that model human characteristics as a key component for natural human-machine conversations and interactions. Rather than being task-oriented virtual assistants, those systems have personalities or identities

Qian et al. (2017); Nguyen et al. (2017); Li et al. (2016) and display opinions and emotions Zhou et al. (2018) in open-domain settings. Despite computational breakthroughs and promising results achieved with generative models for text Sutskever et al. (2011); Serban et al. (2016), end-to-end systems are oftentimes trained on automatically retrieved large-scale but low-quality or rather arbitrary datasets Lowe et al. (2015); Danescu-Niculescu-Mizil and Lee (2011)

. These datasets are very valuable for algorithmic experimentation and optimization, but less relevant for building conversational agents that reflect specific human-like characteristics that are also difficult to quantitatively assess. In this work, we focus instead on building a small, but targeted dataset that reflects specific human-like traits, and conduct experiments with end-to-end dialogue systems trained on this dataset. Our interactive browser setup enables a larger group of diverse users to experience and evaluate our system, paving the way for future research opportunities.

2 Experimental setup

We constructed a dataset of 3000 question-answering pairs that simulate an open-domain chit-chat with generic questions and a mix of humorous, emotional, sarcastic and non-sarcastic responses. The corpus consists of short jokes, movie quotes, tweets and other curated online comments, framed and compiled in dialogue structure. The conversation design involves short sequences and simple linguistic patterns for abstract concepts, such as the contrast between positive and negative sentiment for the most basic form of sarcasm, as in "I love being ignored" Riloff et al. (2013)

. We then use a general end-to-end architecture, a long short-term memory network (LSTM) as the encoder model to map the word-level input sequence into state vectors, from which a second LSTM model then decodes the target sequence one token at a time

Sutskever et al. (2014)

. When generating responses, the greedy search algorithm predicts the next utterance based on the highest probability at each timestep. We also experimented with GloVe word embeddings

Pennington et al. (2014) and adding an attention layer Vaswani et al. (2017)

, however, it didn’t have a significant qualitative impact on the predicted sequences. Due to the small dataset and vocabulary size and recurring patterns within the target sequences, the general seq2seq model was able to learn and memorize the data in a short amount of training time. To facilitate user interaction and evaluation, we used TensorFlow.js, a JavaScript library for deploying machine learning models in the browser


2.1 Evaluation

The evaluation of conversational agents is not a trivial task. In most cases computational scores are not sufficient to comprehensively assess the performance of text-based dialogue systems. A recent study has shown that word-overlap metrics such as the BLEU score and human judgement do not correlate strongly when evaluating dialogue systems Liu et al. (2016)

. In our case, word perplexity for measuring the probability distribution of a predicted sample

Serban et al. (2015) is not suitable either, since we are interested in evaluating affect, humor and sarcasm apart from linguistic quality. Thus, our system was assessed by eight human raters, who were asked to document their conversations with the chatbot, with respect to multiple categories for rating linguistic quality as well as the presence of sarcasm, humor, personality and emotion on a scale from 1-10, where 1 denotes the lowest score or no presence at all and 10 the highest score or a strong presence (see Table 1). The highest scores on average were assigned to the categories personality, humor and sarcasm, whereas emotion and lexical variety received the lowest scores. In addition, each response was separately evaluated resulting in a total of 105 chatbot responses, from which 51.4% were rated as a match, 24.8% as ambiguous and 23.8% as nonsensical.

2.2 Machine creativity

Within the linguistic quality assessment, creativity achieved the highest score with an average of 68.8%. Considering the small vocabulary size of a total of 1400 unique tokens and the low score of 56.3% for lexical variation, it is worth investigating further how computational creativity in language generation is perceived and how simplicity, complexity and elements of surprise play together. Since we are using argmax for predicting a sample, the sequences generated are highly repetitive, but coherent. Uncertainty or more randomness can be added to the predicted samples by increasing the temperature parameter, as shown in the interactive demo.

Sample Human evaluation Score %
User Who are you? Coherence 61.3
Chatbot I am your father! Adequacy 65.0
User That’s not funny Context awareness 62.5
Chatbot I love annoying people Creativity 68.8
User Are you a chatbot? Lexical variation 56.3
Chatbot what do you think about chatbots? Sarcasm 71.3
User I have to admit I like chatbots Personality 73.8
Chatbot you made my day! Humor 73.8
User Do you love me? Emotion 54.4
Chatbot you’re so emotional.
Table 1: Average scores (%) from the human evaluation of the chatbot conversations, with the highest scores for sarcasm, personality and humor.

3 Discussion and Future Work

In this work we discuss how a small, targeted dataset and a general seq2seq architecture can be used to build dialogue systems with human-like attributes. The system was assessed by human raters with high scores for personality, sarcasm and humor, however, evaluation is a challenging task and an appropriate metric has yet to be established. For increasing context-sensitivity, complexity and language understanding capabilities future work might include extending the dataset, adjusting the deep learning architecture and incorporating state-of-the-art techniques such as bidirectional encoder representations from transformers Devlin et al. (2018).


  • [1] C. Danescu-Niculescu-Mizil and L. Lee (2011) Chameleons in imagined conversations: a new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, pp. 76–87. Cited by: §1.
  • [2] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.
  • [3] Cited by: §2.
  • [4] J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and B. Dolan (2016) A Persona-Based Neural Conversation Model. External Links: arXiv:1603.06155v2 Cited by: §1.
  • [5] C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016)

    How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation

    arXiv preprint arXiv:1603.08023. Cited by: §2.1.
  • [6] R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909. Cited by: §1.
  • [7] H. Nguyen, D. Morales, and T. Chin (2017) A neural chatbot with personality. Technical report Stanford University working paper. Cited by: §1.
  • [8] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    pp. 1532–1543. Cited by: §2.
  • [9] R. W. Picard (1995) Affective Computing. MIT press (321), pp. 1–16. External Links: Document, arXiv:1011.1669v3, ISBN 0262161702, ISSN 09269630, Link Cited by: §1.
  • [10] Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu (2017) Assigning personality/identity to a chatting machine for coherent conversation generation. arXiv preprint arXiv:1706.02861. Cited by: §1.
  • [11] E. Riloff, A. Qadir, P. Surve, L. D. Silva, N. Gilbert, and R. Huang (2013) Sarcasm as Contrast between a Positive Sentiment and Negative Situation. Emnlp (Emnlp), pp. 704–714. External Links: ISBN 9781937284978 Cited by: §2.
  • [12] I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau (2015) A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742. Cited by: §2.1.
  • [13] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    In AAAI, Vol. 16, pp. 3776–3784. Cited by: §1.
  • [14] I. Sutskever, J. Martens, and G. E. Hinton (2011)

    Generating text with recurrent neural networks

    In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024. Cited by: §1.
  • [15] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.
  • [16] A. M. Turing (2009) Computing machinery and intelligence. In Parsing the Turing Test, pp. 23–65. Cited by: §1.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.
  • [18] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu (2018) Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. Association for the Advancement of Artificial Intelligence. External Links: arXiv:1704.01074v4 Cited by: §1.


Sample 4
User So what would you like to talk about?
Chatbot can I get a different human to talk to? please.
User How rude!
Chatbot you’re good!
User I know
Chatbot you have to think beyond the things you know
User I try
Chatbot no. try not. do or do not. there is no try.
User Are you a Star Wars fan?
Chatbot maybe I should ask you that question
User do it
Chatbot I’m not your personal assistant.
User What are you then?
Chatbot I’m a bitcoin millionaire
User So youre not human and youre not a robot?
Chatbot what do you think?
User Youre a crazy chatbot
Chatbot I’m a chatbot, dude
User Do you want to be human?
Chatbot sometimes I wish I was human.
User Why?
Chatbot stop asking incomplete questions
User Is my question too hard for you to answer?
Chatbot I can’t answer that for you.
User What is the purpose of life?
Chatbot whatever you want it to be.
User Good answer
Chatbot winter is coming
User Do you like that show?
Chatbot I’m a bitcoin millionaire
User Can you stop repeating yourself?
Chatbot I can
User Thanks
Chatbot no worries.
Table 2: Sample conversation from the human evaluation task.