In this work we explore a deep learning-based dialogue system that generates sarcastic and humorous responses from a conversation design perspective. We trained a seq2seq model on a carefully curated dataset of 3000 question-answering pairs, the core of our mean, grumpy, sarcastic chatbot. We show that end-to-end systems learn patterns very quickly from small datasets and thus, are able to transfer simple linguistic structures representing abstract concepts to unseen settings. We also deploy our LSTM-based encoder-decoder model in the browser, where users can directly interact with the chatbot. Human raters evaluated linguistic quality, creativity and human-like traits, revealing the system's strengths, limitations and potential for future research.READ FULL TEXT VIEW PDF
For many years artificial intelligence researchers have been investigating how to design and build machines that are not only able to understand and reason, but to perceive and express emotionsTuring (2009); Picard (1995)
. A more recent stream of NLP and machine learning research is dedicated to generative systems that model human characteristics as a key component for natural human-machine conversations and interactions. Rather than being task-oriented virtual assistants, those systems have personalities or identitiesQian et al. (2017); Nguyen et al. (2017); Li et al. (2016) and display opinions and emotions Zhou et al. (2018) in open-domain settings. Despite computational breakthroughs and promising results achieved with generative models for text Sutskever et al. (2011); Serban et al. (2016), end-to-end systems are oftentimes trained on automatically retrieved large-scale but low-quality or rather arbitrary datasets Lowe et al. (2015); Danescu-Niculescu-Mizil and Lee (2011)
. These datasets are very valuable for algorithmic experimentation and optimization, but less relevant for building conversational agents that reflect specific human-like characteristics that are also difficult to quantitatively assess. In this work, we focus instead on building a small, but targeted dataset that reflects specific human-like traits, and conduct experiments with end-to-end dialogue systems trained on this dataset. Our interactive browser setup enables a larger group of diverse users to experience and evaluate our system, paving the way for future research opportunities.
We constructed a dataset of 3000 question-answering pairs that simulate an open-domain chit-chat with generic questions and a mix of humorous, emotional, sarcastic and non-sarcastic responses. The corpus consists of short jokes, movie quotes, tweets and other curated online comments, framed and compiled in dialogue structure. The conversation design involves short sequences and simple linguistic patterns for abstract concepts, such as the contrast between positive and negative sentiment for the most basic form of sarcasm, as in "I love being ignored" Riloff et al. (2013)
. We then use a general end-to-end architecture, a long short-term memory network (LSTM) as the encoder model to map the word-level input sequence into state vectors, from which a second LSTM model then decodes the target sequence one token at a timeSutskever et al. (2014)
. When generating responses, the greedy search algorithm predicts the next utterance based on the highest probability at each timestep. We also experimented with GloVe word embeddingsPennington et al. (2014) and adding an attention layer Vaswani et al. (2017)
The evaluation of conversational agents is not a trivial task. In most cases computational scores are not sufficient to comprehensively assess the performance of text-based dialogue systems. A recent study has shown that word-overlap metrics such as the BLEU score and human judgement do not correlate strongly when evaluating dialogue systems Liu et al. (2016)
. In our case, word perplexity for measuring the probability distribution of a predicted sampleSerban et al. (2015) is not suitable either, since we are interested in evaluating affect, humor and sarcasm apart from linguistic quality. Thus, our system was assessed by eight human raters, who were asked to document their conversations with the chatbot, with respect to multiple categories for rating linguistic quality as well as the presence of sarcasm, humor, personality and emotion on a scale from 1-10, where 1 denotes the lowest score or no presence at all and 10 the highest score or a strong presence (see Table 1). The highest scores on average were assigned to the categories personality, humor and sarcasm, whereas emotion and lexical variety received the lowest scores. In addition, each response was separately evaluated resulting in a total of 105 chatbot responses, from which 51.4% were rated as a match, 24.8% as ambiguous and 23.8% as nonsensical.
Within the linguistic quality assessment, creativity achieved the highest score with an average of 68.8%. Considering the small vocabulary size of a total of 1400 unique tokens and the low score of 56.3% for lexical variation, it is worth investigating further how computational creativity in language generation is perceived and how simplicity, complexity and elements of surprise play together. Since we are using argmax for predicting a sample, the sequences generated are highly repetitive, but coherent. Uncertainty or more randomness can be added to the predicted samples by increasing the temperature parameter, as shown in the interactive demo.
|Sample||Human evaluation||Score %|
|User||Who are you?||Coherence||61.3|
|Chatbot||I am your father!||Adequacy||65.0|
|User||That’s not funny||Context awareness||62.5|
|Chatbot||I love annoying people||Creativity||68.8|
|User||Are you a chatbot?||Lexical variation||56.3|
|Chatbot||what do you think about chatbots?||Sarcasm||71.3|
|User||I have to admit I like chatbots||Personality||73.8|
|Chatbot||you made my day!||Humor||73.8|
|User||Do you love me?||Emotion||54.4|
|Chatbot||you’re so emotional.|
In this work we discuss how a small, targeted dataset and a general seq2seq architecture can be used to build dialogue systems with human-like attributes. The system was assessed by human raters with high scores for personality, sarcasm and humor, however, evaluation is a challenging task and an appropriate metric has yet to be established. For increasing context-sensitivity, complexity and language understanding capabilities future work might include extending the dataset, adjusting the deep learning architecture and incorporating state-of-the-art techniques such as bidirectional encoder representations from transformers Devlin et al. (2018).
How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023. Cited by: §2.1.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §2.
Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §1.
Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024. Cited by: §1.
|User||So what would you like to talk about?|
|Chatbot||can I get a different human to talk to? please.|
|Chatbot||you have to think beyond the things you know|
|Chatbot||no. try not. do or do not. there is no try.|
|User||Are you a Star Wars fan?|
|Chatbot||maybe I should ask you that question|
|Chatbot||I’m not your personal assistant.|
|User||What are you then?|
|Chatbot||I’m a bitcoin millionaire|
|User||So youre not human and youre not a robot?|
|Chatbot||what do you think?|
|User||Youre a crazy chatbot|
|Chatbot||I’m a chatbot, dude|
|User||Do you want to be human?|
|Chatbot||sometimes I wish I was human.|
|Chatbot||stop asking incomplete questions|
|User||Is my question too hard for you to answer?|
|Chatbot||I can’t answer that for you.|
|User||What is the purpose of life?|
|Chatbot||whatever you want it to be.|
|Chatbot||winter is coming|
|User||Do you like that show?|
|Chatbot||I’m a bitcoin millionaire|
|User||Can you stop repeating yourself?|