End-to-end dialogue systems, based on neural architectures like bidirectional LSTMs or Memory Networks Sukhbaatar et al. (2015) trained directly by gradient descent on dialogue logs, have been showing promising performance in multiple contexts Wen et al. (2016); Serban et al. (2016); Bordes et al. (2016). One of their main advantages is that they can rely on large data sources of existing dialogues to learn to cover various domains without requiring any expert knowledge. However, the flip side is that they also exhibit limited engagement, especially in chit-chat settings: they lack consistency and do not leverage proactive engagement strategies as (even partially) scripted chatbots do.
Zhang et al. (2018) introduced the Persona-chat dataset as a solution to cope with this issue. This dataset consists of dialogues between pairs of agents with text profiles, or personas, attached to each of them. As shown in their paper, conditioning an end-to-end system on a given persona improves the engagement of a dialogue agent. This paves the way to potentially end-to-end personalized chatbots because the personas of the bots, by being short texts, could be easily edited by most users. However, the Persona-chat dataset was created using an artificial data collection mechanism based on Mechanical Turk. As a result, neither dialogs nor personas can be fully representative of real user-bot interactions and the dataset coverage remains limited, containing a bit more than 1k different personas.
In this paper, we build a very large-scale persona-based dialogue dataset using conversations previously extracted from Reddit111https://www.reddit.com/r/datasets/comments/3bxlg7/
. With simple heuristics, we create a corpus of over 5 million personas spanning more than 700 million conversations. We train persona-based end-to-end dialogue models on this dataset. These models outperform their counterparts that do not have access to personas, confirming results ofZhang et al. (2018). In addition, the coverage of our dataset seems very good since pre-training on it also leads to state-of-the-art results on the Persona-chat dataset.
2 Related work
With the rise of end-to-end dialogue systems, personalized trained systems have started to appear. Li et al. (2016) proposed to learn latent variables representing each speaker’s bias/personality in a dialogue model. Other classic strategies include extracting explicit variables from structured knowledge bases or other symbolic sources as in Ghazvininejad et al. (2017); Joshi et al. (2017); Young et al. (2017). Still, in the context of personal chatbots, it might be more desirable to condition on data that can be generated and interpreted by the user itself such as text rather than relying on some knowledge base facts that might not exist for everyone or a great variety of situations. Persona-chat Zhang et al. (2018) recently introduced a dataset of conversations revolving around human habits and preferences. In their experiments, they showed that conditioning on a text description of each speaker’s habits, their persona, improved dialogue modeling.
In this paper, we use a pre-existing Reddit data dump as data source. Reddit is a massive online message board. Dodge et al. (2015) used it to assess chit-chat qualities of generic dialogue models. Yang et al. (2018) used response prediction on Reddit as an auxiliary task in order to improve prediction performance on natural language inference problems.
3 Building a dataset of millions of persona-based dialogues
Our goal is to learn to predict responses based on a persona for a large variety of personas. To that end, we build a dataset of examples of the following form using data from Reddit:
Persona: [“I like sport”, “I work a lot”]
Context: “I love running.”
Response: “Me too! But only on weekends.”
The persona is a set of sentences representing the personality of the responding agent, the context is the utterance that it responds to, and the response is the answer to be predicted.
As in Dodge et al. (2015), we use a preexisting dump of Reddit
that consists of 1.7 billion comments. We tokenize sentences by padding all special characters with a space and splitting on whitespace characters. We create a dictionary containing the 250k most frequent tokens. We truncate comments that are longer than 100 tokens.
3.2 Persona extraction
We construct the persona of a user by gathering all the comments they wrote, splitting them into sentences, and selecting the sentences that satisfy the following rules: (i) each sentence must contain between 4 and 20 words or punctuation marks, (ii) it contains either the word I or my, (iii) at least one verb, and (iv) at least one noun, pronoun or adjective.
To handle the quantity of data involved, we limit the size of a persona to sentences for each user. We compare four different setups for persona creation. In the rules setup, we select up to random sentences that satisfy the rules above. In the rules + classifier
setup, we filter with the rules then score the resulting sentences using a bag-of-words classifier that is trained to discriminatePersona-chat persona sentences from random comments. We manually tune a threshold on the score in order to select sentences. If there are more than eligible persona sentences for a given user, we keep the highest-scored ones. In the random from user setup, we randomly select sentences uttered by the user while keeping the sentence length requirement above (we ignore the other rules). The random from dataset baseline refers to random sentences from the dataset. They do not necessarily come from the same user. This last setup serves as a control mechanism to verify that the gains in prediction accuracy are due to the user-specific information contained in personas.
In the example at the beginning of this section, the response is clearly consistent with the persona. There may not always be such an obvious relationship between the two: the discussion topic may not be covered by the persona, a single user may write contradictory statements, and due to errors in the extraction process, some persona sentences may not represent a general trait of the user (e.g. I am feeling happy today).
3.3 Dataset creation
We take each pair of successive comments in a thread to form the context and response of an example. The persona corresponding to the response is extracted using one of the methods of Section 3.2. We split the dataset randomly between training, validation and test. Validation and test sets contain 50k examples each. We extract personas using training data only: test set responses cannot be contained explicitly in the persona.
In total, we select personas covering 4.6m users in the rule-based setups and 7.2m users in the random setups. This is a sizable fraction of the total 13.2m users of the dataset; depending on the persona selection setup, between 97 and 99.4 % of the training set examples are linked to a persona.
4 End-to-end dialogue models
We model dialogue by next utterance retrieval Lowe et al. (2016), where a response is picked among a set of candidates and not generated.
, we combine the encoded context and persona using a 1-hop memory network with a residual connection, using the context as query and the set of persona sentences as memory. We also encode all candidate responses and compute the dot-product between all those candidate representations and the joint representation of the context and the persona. The predicted response is the candidate that maximizes the dot product.
We train by passing all the dot products through a softmax and maximizing the log-likelihood of the correct responses. We use mini-batches of training examples and, for each example therein, all the responses of the other examples of the same batch are used as negative responses.
4.2 Context and response encoders
Both context and response encoders share the same architecture and word embeddings but have different weights in the subsequent layers. We train three different encoder architectures.
applies two linear projections separated by a non-linearity to the word embeddings. We then sum the resulting sentence representation across all positions in the sentence and divide the result by where is the length of the sequence.
applies a 2-layer bidirectional LSTM. We use the last hidden state as encoded sentence.
is a variation of an End-to-end Memory Network Sukhbaatar et al. (2015) introduced by Vaswani et al. (2017). Based solely on attention mechanisms, it exhibited state-of-the-art performance on next utterance retrieval tasks in dialogues Yang et al. (2018). Here we use only its encoding module. We subsequently average the resulting representation across all positions in the sentence, yielding a fixed-size representation.
4.3 Persona encoder
The persona encoder encodes each persona sentence separately. It relies on the same word embeddings as the context encoder and applies a linear layer on top of them. We then sum the representations across the sentence.
We deliberately choose a simpler architecture than the other encoders for performance reasons as the number of personas encoded for each batch is an order of magnitude greater than the number of training examples. Most personas are short sentences; we therefore expect a bag-of-words representation to encode them well.
|Context (Persona)||Predicted Answer|
|Where do you come from?|
|(I was born in London.)||I’m from London, studying in Scotland.|
|(I was born in New York.)||I’m from New York.|
|What do you do?|
|(I am a doctor.)||I am a sleep and respiratory therapist.|
|(I am an engineer.)||I am a software developer.|
We train models on the persona-based dialogue dataset described in Section 3.3 and we evaluate its accuracy both on the original task and when transferring onto Persona-chat.
5.1 Experimental details
We optimize network parameters using Adamax with a learning rate of
on mini-batches of size 512. We initialize embeddings with FastText word vectors and optimize them during learning.
LSTMs use a hidden size of 150; we concatenate the last hidden states for both directions and layers, resulting in a final representation of size 600. Transformer architectures on reddit use 4 layers with a hidden size of 300 and 6 attention heads, resulting in a final representation of size 300. We use Spacy for part-of-speech tagging in order to verify the persona extraction rules. We distribute the training by splitting each batch across 8 GPUs; we stop training after 1 full epoch, which takes about 3 days.
We used the revised version of the dataset where the personas have been rephrased, making it a harder task. The dataset being only a few thousands samples, we had to reduce the architecture to avoid overfitting for the models trained purely on Persona-chat. 2 layers, 2 attention heads, a dropout of 0.2 and keeping the size of the word embeddings to 300 units yield the highest accuracy on the validation set.
As basic baseline, we use an information retrieval (IR) system that ranks candidate responses according to a TF-IDF weighted exact-match similarity with the context alone.
Impact of personas
We report the accuracy of the different architectures on the reddit task in Table 1. Conditioning on personas improves the prediction performance regardless of the encoder architecture. Table 2 gives some examples of how the persona affects the predicted answer.
|20||rules + classifier||70.7|
|100||rules + classifier||72.5|
|100||random from user||73.8|
|100||random from dataset||66.9|
Influence of the persona extraction
In Table 3, we report precision results for several persona extraction setups. The rules setup improves the results somewhat, however adding the persona classifier actually degrades the results. A possible interpretation is that the persona classifier is trained only on the Persona-chat revised personas, and that this selection might be too narrow and lack diversity. Increasing the maximum persona size also improves the prediction performance.
|Zhang et al. (2018)||35.4||–|
We compare the performance of transformer models trained on Reddit and on Persona-chat on both datasets. We report results in Table 4. This architecture provides a strong improvement over the results of Zhang et al. (2018), jumping from 35.4% hits@1 to 42.1%. Pretraining the model on Reddit and then fine-tuning on Persona-chat pushes this score to 60.7%, largely improving the state of the art. As expected, fine-tuning on Persona-chat reduces the performance on Reddit. However, directly testing on Persona-chat the model trained on Reddit without fine-tuning yields a very low result. This could be a consequence of a discrepancy between the style of personas of the two datasets.
This paper shows how to create a very large dataset for persona-based dialogue. We show that training models to align answers both with the persona of their author and the context improves the predicting performance. The trained models show promising coverage as exhibited by the state-of-the-art transfer results on the Persona-chat dataset. As pretraining leads to a considerable improvement in performance, future work could be done fine-tuning this model for various dialog systems. Future work may also entail building more advanced strategies to select a limited number of personas for each user while maximizing the prediction performance.
- Bordes et al. (2016) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.
- Dodge et al. (2015) Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931.
- Ghazvininejad et al. (2017) Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2017. A knowledge-grounded neural conversation model. CoRR, abs/1702.01932.
- Joshi et al. (2017) Chaitanya K Joshi, Fei Mi, and Boi Faltings. 2017. Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. CoRR, abs/1603.06155.
- Lowe et al. (2016) Ryan Lowe, Iulian V Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. On the evaluation of dialogue systems with next utterance classification. arXiv preprint arXiv:1605.05414.
Serban et al. (2016)
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
Joelle Pineau. 2016.
Building end-to-end dialogue systems using generative hierarchical neural network models.In AAAI, volume 16, pages 3776–3784.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. Proceedings of NIPS.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.
- Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562.
- Yang et al. (2018) Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong, Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Learning semantic textual similarity from conversations. CoRR, abs/1804.07754.
- Young et al. (2017) Tom Young, Erik Cambria, Iti Chaturvedi, Minlie Huang, Hao Zhou, and Subham Biswas. 2017. Augmenting end-to-end dialog systems with commonsense knowledge. arXiv preprint arXiv:1709.05453.
- Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? CoRR, abs/1801.07243.