is one of the most challenging problems in artificial intelligence. In the past few years, neural generative models have achieved surprising performance on dialogue generation due to the sequence-to-sequence (Seq2seq) models[20, 26, 25, 17]. Unfortunately, basic Seq2seq models apply a unified language model for all users without considering the user personality. A few enhanced Seq2seq models have been proposed to customize a unique language model for each user. For example, when the user descriptions are available, such persona information can be explicitly inserted into the response generation process [9, 27]. Or, when conversations collected from each user are sufficient, user information can be implicitly represented as a user embedding to boost the decoding in Seq2seq models. However, neither of these requirements can be satisfied in practice. Collecting non-synthetic persona descriptions requires many human efforts . Whereas, it is often the case that at most hundreds of conversation sentences can be obtained from each user, which are insufficient to learn a good user representation .
The recently proposed model agnostic meta learning (MAML)  has pointed out a promising direction for building the personalized conversational generative models to be applicable in real situations. MAML finds a set of initial parameters for all tasks, then fine-tunes the model with each task’s training samples to obtain its corresponding model. For our problem, we can first train a public language model on all users, and then treat fine-tuning the model parameters for each person as a task, in which we can utilize the idea of MAML to adapt the model to a given user using very few training conversation data from that user [10, 14, 11]. The MAML-based methods achieve a much better performance than all the Seq2seq-based methods in the few-shot setting.
Despite the apparent success in few-shot learning scenarios, MAML still has limitations . Conversation system is to build a function to map input utterances to output utterances , where the function is determined by its model structure and parameters. Personalizing the conversation systems is to customize a function for each user. MAML can customize each user a function by fine tuning. However, to optimize the function, MAML can only search for the optimal parameters for each user while fixing the model structures , which restrains the model capacity and can get only the local optima. Actually, we observe that MAML tends to produce too similar language models when applied on personalized conversation systems. Since all the language models are fine-tuned from the same initial parameters with a few samples, parameters change slightly in adaptation. We will illuminate this point in the experiment section.
To address the above issues, we propose the Customized Model Agnostic Meta-Learning algorithm (CMAML) that customizes language models with different structures under the MAML framework. For each user, the language model consists of three parts: a shared structure to learn the public language generation ability and common characteristics among users; a private structure to model the unique characteristic of that user; and a gate to balance the information from shared and private parts and generate the final outputs. The shared part and gating part have the same structure with similar parameters for all users. While the private part starts from the same network, then differentiate into different structures according to the personal data of each user. Users with similar characteristics share partial private structure; their training samples are reused to update the parameters in the overlap structures. In this way, our algorithm optimizes the language model by searching both model structure and parameter for each user to reach the global optima. We will release the code upon acceptation.
In summary, our contributions are as follows:
We are the first to customize unique conversational language models for users using only a few personal training samples. Compared with MAML, the proposed CMAML can distinguish different users in the model structure level.
We propose a structure pruning algorithm that can adjust the network structure for better fitting the training data. We use this strategy to customize unique language models for users.
We conduct an empirical study on the impact factors on the MAML-based models, which may inspire the researchers in related fields.
Personalized Conversation Systems
Personalization is a hot topic in building the conversation systems in the NLP field as it makes a conversation system more human-like. If the persona data is sufficient, it is relatively easy to customize a unique language model for every user. One way is to use the persona descriptions of users to build language models. persona (persona) propose to calculate the attention of the current query over all the profile sentences, then use this attention to re-write the generated replies. lixiang (lixiang) solve the problem in the proactive chatting manner. They propose to use the keyword in the profile as the “search query” to search a proper reply that contains the keyword. Another way is to use persona data to train a user embedding and incorporate this embedding into the generation-based model to enhance the personality. lijiwei (lijiwei) propose a Speaker model that concatenates the user embedding with the input word in a decoder and train the parameters in the same way as word embeddings. Exploring (Exploring) extend the Speaker model by further incorporating the chatting history in the decoder. mo2018personalizing (mo2018personalizing) propose to first calculate the consistency score between candidate reply and the user embedding, then use this score as a reward to encourage the personalization in the reinforcement learning framework. However, in practice, enough training data or person description are always are available. Hence, in this paper, we discuss how to build personalized conversation systems in the few-shot setting.
In the few-shot setting, the most popular method is to build personalized conversation systems using model agnostic meta-learning (MAML) framework. MAML first finds the general initial parameters that capture the mutual features among all the tasks, then adapts the parameters to each task via fine-tuning with a few training samples. paml (paml) propose to regard each user as the task, and endow the language models with personalization by fine-tuning on the personal data. There are some works [14, 11] that are not for personalized conversation systems but for multi-domain task-oriented systems, and they regard each domain as a task. All of the above methods simply adapt MAML to their scenario, and the personalization always counts on fine-tuning, which is not enough to capture the unique characteristics of different tasks.
Meta-learning has received increasing attention recently. The biggest attraction of meta-learning is its fast adaptation ability on new task using a few training data. There are three categories of meta-learning-based methods: Metric-based [7, 23, 19, 18], Model-based  and Optimization-based [4, 15, 1]. The first two are used in classification tasks, and the third one is model-agnostic. Therefore, we employ a typical optimization-based method model agnostic meta-learning (MAML) .
Since we already discuss how to apply MAML for the personalized conversation systems, we now introduce some MAML-based works for different tasks which also aim to model the uniqueness for each task. MAML provides a shared network with the same parameters as the initial model for all the task. To encourage the task-specific adaptation, attentive (attentive) propose to use an attention mechanism over the features extracted from the meta-parameters. This method picks out user preference from the shared features among all the users but fails to find the specific features that are not included in the shared features. An alternative method facilitating the specific characteristics of each task is to conduct some operation on the shared network. Sun_2019_CVPR (Sun_2019_CVPR) propose to not only apply fine-tuning but also shifting and scaling on the shared work for each task. This method also does not work in the few-shot setting, and the shifting and scaling can only be applied when the shared network structure is only comprised of feed-forward layers.
In this section, we first provide an overview of CMAML algorithm, as well as the basic language model structure of users. We then introduce the shared structure, private structure, gating structure, and the training methods of those three parts.
To boost the personalization and customize a unique language model for each user, we propose a Customize language models with Model Agnostic Meta-Learning (CMAML) algorithm. CMAML not only distinguishes different language models in the model structure level, but also trains all the models in a more sample-efficient way.
with LSTM cells for users’ language models. To endow the personalization of the generated replies, we maintain the same structure in encoder as the vanilla Seq2seq but change the basic generation cells in the decoder. The basic generation unit in decoder consists of three cells with different functions: a long short-term memory (LSTM) cell, a multiple layer perception (MLP) cell, and a gating cell. The LSTM cell aims to learn the basic language generation ability, so it is sharable among all the users. The encoder is also shared among all the users, so we call the LSTM cell along with the encoder as the “Shared Structure”. The MLP cell aims at modeling the unique characteristics of each user, and will not be influenced by the sharable features. To encourage the personalization of different users, the MLP part starts from the same structure, then evolves into the different structures to better fit the unique characteristics of each user during the training. Each user has a unique MLP, so we call this part as “Private Structure”. The Gating cell is to fuse the information from LSTM and MLP cells. It takes the outputs of them and determines whether to use the word chosen by basic language model or by user’s specific wording habits, so we call it “Gating Structure”.
The training of CMAML consists of three steps: Pre-training, Private network pruning, and Partial ensemble training. The first step finds a proper initial parameter for the whole model. The second step is specially for the private part when the private part of each user differentiates into various structures to better fit the user’s characteristics. The third step jointly trains all three structures together.
Essentially, CMAML is an extension of MAML. CMAML follows the basic setting in MAML: each user is regarded as a task; the training data of each user is split into a support set and a validation set , noted as . The difference between MAML and our proposed CMAML is that MAML uses the same structure with the same initial parameters to build language models for users, while CMAML can produce unique language models with different private structures for different users.
In the following part, we will present the shared, private, gating structure and their training strategies separately.
The shared structure is as same as a vanilla Seq2seq model with LSTM, where the parameters are shared among all users. At each time step during the generation, the decoder feeds the word and last hidden state to the LSTM cell, and then outputs a distribution over the vocabulary.
We train the shared structure at the first and third training steps of CMAML, and its training strategy is as same as MAML’s. Meta-learning finds a set of mutual initial parameters for all tasks, which can fast adapt to each task using a few training samples. In training, there are two main procedures performed iteratively: meta-training and meta-testing. Meta-training fine-tunes into with the support sets of several tasks; meta-testing updates with the loss calculated by with the validation sets of the same set of tasks. In testing, is regarded as the initial parameters, then it is fine-tuned based on the data of a new task.
Specifically, at the first beginning, is randomly initialed. In the meta-training, we first sample a set of users , and then adapt on their support set to get as,
In the meta-testing, we test on the validation set and update by obtained loss as,
The private structure is a part of the decoder. It is an MLP cell that takes word as input, then outputs a distribution over the vocabulary. Private structure participates in all three training steps of CMAML.
In this step, the private structure does not differentiate yet, so all the users share the same private structure. The training method is as same as shared structure.
Private Network Pruning
After the pre-training, CMAML adjusts the structure of the private network for each user to fit their unique characteristics better. For each user, we only keep a part of the network that is useful to generate the personalized utterances and abandon the part irrelevant to the user’s interest by pruning as following.
First, we re-train the whole language model on the samples of each user with L-1 normalization. L-1 normalization makes the network more sparse, so only the parameters that are beneficial to generate personalized sentences are kept. Notice that we only apply sparsification on the private part.
Second, we apply an up-to-bottom strategy to prune the private network for each user. Since the private structure is a multi-layer perception, the pruning on structure equates to selecting the edges between layers. Edges selection starts from the second top layer, since we need preserve all the edges connected to the top (first) layer for prediction the word distribution over the whole vocabulary. For -th top layer (), we only keep the edges between -th and -th layer whose weight excels a certain threshold . Then, in -th layer, only the node connected with selected edges would be available for the edge selection of lower layer.
Partial Joint Training
After the first two steps, every user has a unique private structure, and we jointly train all users’ structures in this step. In the traditional MAML, every user’s model initialization is fine-tuned by all the users. We allow users to share a part of their private structures with others and call the shared part as overlap structure. The overlap structure is trained by the training samples of all users involved in this part. Therefore, a user only cooperates with the users having common characteristics.
Concretely, we enable the partial structure sharing among the users in this way: We keep all the edges in the whole private structure, and we can access to any edges by indexing. Each user would fetch and update the parameters of edges related to itself. If an edge is fetched and updated by more than one users, that edge is the overlap structure of involved users.
In the meta-training, we adapt as,
In the meta-testing, we update as,
We introduce the gating structure to balance the information from the shared structure and private structure and output the final distribution over the whole vocabulary as,
We train the gating structure at the first and third training step of CMAML, together with the shared structure.
The overall procedure of CMAML is illustrated in Algorithm 2. During the testing, a new user first finds his/her private structure via pruning algorithm, then trains all three parts together with his/her samples to obtain a unique language model.
Compared with MAML, CMAML is superior in two aspects. First, CMAML customizes unique language models for different users, while MAML produces language models with minor divergence. In MAML, all personal features are grafted on the general features learned from shared initial parameters, so some of the unique features such as distinctive wording habits can not be memorized in the language model only through fine-tuning. While CMAML has a private structure that can extract special features only from personal samples. Second, the data has been fully utilized for training each language model in CMAML. To adapt to each user, MAML uses the parameter initialization trained with all the users’ data. In CMAML, only users with the same characteristics can share a overlap structure and learn from the training data of related users.
Training of CMAML consists of three steps. In both pre-training (the first step) and partial ensemble training (the third step), the whole model (including shared, private, and gating parts) is trained together. The pruning (the second step) finds a proper private structure for each user via adaptation on the user’s personal samples, and it does not apply the gradient descent on any parameters. Hence, all the parameters in the third step inherit the values from the pruned result of the first step.
As shown in Figure 2, for the shared and gating structure, gradients are updated in the same way as MAML: they collect the loss from all users, calculate the total gradient and distribute the gradient to update all users’ parameters. For the private structure, a user’s samples will only be applied to the structures related to itself, and update the parameters of related structures.
The loss function for all the language models is the negative log-likelihood of generating the responsegiven the input query as,
|Method||Perplexity||C Score||BLEU||Distinct-1||Diff Score||Score|
We evaluate our approach on Persona-chat , a personalized conversation dataset. We follow the same meta-set split as in PAML  to divide the dataset into 3 parts for training, validation, and evaluation, with 1137, 99 and 100 users respectively. Each user has 8.1 unique conversation samples and 121.1 utterances on average.
and use a 4-layer MLP for private structure. The dimension of word embedding, hidden state, and MLP’s output are set to 300. In CMAML, we pre-train the model for 10 epochs and re-train each language model for 5 steps to prune private network. The L-1 weight in the re-training stage is 0.001, and the thresholdis 0.075. We follow the hyperparameter settings of the optimizer, and in PAML. The validation set is used for early stopping based on the perplexity measurement. 111We will release the code upon acceptation.
We compare our models with Seq2seq-based models, fine-tuning models, and meta-learning models. All the methods use the same dimension of word embedding and hidden state for a fair comparison.
Seq2seq. Standard Seq2seq model with attention .
Seq2seq-F. Seq2seq model with fine-tuning on user’s personal corpora.
Speaker. Seq2seq-based model with user embeddings incorporated in LSTM cell proposed by .
Speaker-F. Speaker model with fine-tuning on user’s pernalized corpora.
PAML. A meta-learning-based model proposed by . It simply applies MAML on the conversation system scenario.
ATAML. ATAML  extends MAML by picking out task-specific parameters from the shared model. We treat the encoder and decoder of Transformer  as shared parameters and task-specific parameters respectively.
CMAML. Our CMAML where the private structure is grafted on the shared structure. It takes the hidden state from LSTM as the input of MLP.
CMAML. Our full model CMAML where the private structure locates beside the shared structure.
We performed the human evaluation, the agreement score  among annotators can not meet the requirement of the valid annotation. Hence, we measure the replies produced by each method in terms of personality, quality, and diversity.
C-Score. C-Score is for personality mesurement . It uses a pre-trained Natural Language Inference (NLI) model to evaluate the conversation consistency with persona description.
BLEU & Perplexity. BLEU and Perplexity are for quality measurement. BLEU  measures the word overlap between the reference and the generated sentence. Perplexity is the negative logarithm of the generated sentence.
Distinct. Distinct-1  is for diversity measurement. It calculate the ratio of distinct 1-gram in the generated sentence.
|Perplexity||C Score||BLEU||Perplexity||C Score||BLEU|
|Method||Similar Users||Dissimilar Users|
|Perplexity||C Score||BLEU||Perplexity||C Score||BLEU|
First, the non-meta-learning methods are baselines that don’t work well in the few-shot setting. The basic Seq2seq model performs poorly on C score, BLEU, and Distinct-1 since it ignores the user personality. Seq2seq-F improves C score and BLEU of Seq2seq , showing that fine-tuning changes the language models for each user. It should be noted that the Speaker model performs worse than the Seq2seq model on perplexity, C score, and BLEU. The possible cause is that Speaker needs a large quantity of data to train the user embeddings and isn’t well-trained in the few-shot setting.
Second, meta-learning methods perform well in the few-shot setting, and our model is the best one among them. We can see that MAML finds a better parameter initialization to adapt to different users. ATAML performs better C score and Distinct-1 than PAML, but its perplexity and BLEU are poor. It is because that ATAML takes task-specific features into account but simply separates these features from the shared model. As a result, ATAML can catch more user personality but cannot learn the general feature extracted by the shared model well. Our two models, CMAML and CMAML, perform best in all the meta-learning methods in terms of personality, quality, and diversity. CMAML improves the perplexity, C score, BLEU and Distinct-1 compared to PAML, indicating the importance of dividing the model into the shared structure and private structure. It also improves the perplexity, BLEU, and Distinct-1 compared to ATAML, showing that using an isolated unit is a better choice for capturing private features. CMAML performs better on perplexity and Distinct-1 than CMAML, because the private structure can be seen as an additional unit of LSTM with as input, CMAML uses and as input, so it can capture more private features than CMAML.
Different from fine-tuning based methods and MAML-based methods, CMAML change both the parameters and the structure for each user. As a result, it has a higher structure diversity and more model change after adaptation.
We define the model difference between model and model as the Euclidean distance of their parameters normalized by their parameter count. The equation is as follow:
where and are the parameters of and , is the parameter number of the model.
To prove that CMAML increases the structural diversity of different users, we define the average model difference among all the user pairs as the Diff Score of each method. It evaluates the structural diversity of different users. We can see that fine-tuning methods and PAML have low structure diversity because they only change parameters for different users. Although ATAML splits task-specific features, its shared structure is fixed when adapting to each user, so its Diff Score is also low. Our model has different private structures for different users, resulting in high structural diversity.
To prove that our model CMAML changes the language model more than other methods do, we define the average model difference before and after fine-tuning among all the users as the Score of each method. We can obverse that the fine-tuning in PAML does not lead to higher model change than fine-tuning in Seq2seq model, and it achieves a relatively low Score among all the methods. It proves that the fine-tuning in MAML only make slight changes from the initial parameters, so MAML tends to produce similar models for all task. In contrast, our model has the highest Score, indicating the private part increases the model differences among different users.
Since MAML is an efficient method for few-shot learning and is trained on multiple different tasks, we concern about “how few” the training data should be and “how different” the tasks should be. So in this section, we explore two impact factors of the competing methods: the quantity of each user’s training data and user similarity.
To explore the effect of data quantity, we evaluate all the methods in a 10-shot and 15-shot setting where the quantity of each user’s conversations is less than 10 and 15 respectively. Table 2 shows the results.
For non-meta-learning methods, the performance improves as the quantity of training data improves. Meta-learning methods perform better in few-shot settings. The BLEU of CMAML improves as the data increasing from 10-shot to 15-shot, but the C score doesn’t change a lot. We can see that CMAML acquires basic language generation ability when the training data is insufficient. The BLEU of CMAML on the full dataset in Table 1 is the same as that in 15-shot setting, but the C score of CMAML improves compared to the 15-shot results. We can see that when the training data is insufficient, CMAML focuses on learning the basic language generation ability; while when the training data is sufficient, CMAML starts to learn the user personality. These reflect that shared structure and private structure play different roles in the training stage.
To explore the effect of user similarity, we construct two datasets which consist of similar users and dissimilar users respectively based on Persona-chat. We divide the conversations of one user into multiple ”users” to get similar users, and this dataset utilizes 10% data of Persona-chat. For a fair comparison, we randomly sample 10% dialogues for each user in Persona-chat to get a dataset with dissimilar users. We evaluate the competing methods on the two datasets. Table 3 shows the results.
The performance of all the methods is close to each other in the similar-users setting. It shows that meta-learning methods are not suitable for similar tasks. In the dissimilar-users setting, our model performs best on C score and BLEU. We draw a conclusion that user similarity influences the performance of our model. It is noteworthy that compared to that in dissimilar-users setting, the BLEU in the similar-users setting is high, but the C score is low. The possible cause is that the model is easy to train but hard to extract personality features if the users are similar.
In this paper, we address the problem of personalized conversation systems. We propose CMAML, which is able to customize unique conversational models for different users. CMAML introduces a private network for each user’s language model, whose structure will evolve during the training to better fit the characteristics of this user. The private structure will only be trained on the corpora of the corresponding user and his/her similar users. In this way, CMAML not only keeps the advantages of MAML, but also improves the model’s ability to capture the unique characteristics of each user. The experiment results show that CMAML achieves the best performance in terms of personalization, quality, and diversity. We also measure the model differences among all the users, and the results prove that our method can produce quite dissimilar language models for different users.
-  (2016) Learning to learn by gradient descent by gradient descent. In NeuralPS, pp. 3981–3989. Cited by: Meta-learning.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Overview, Implementation Details, Competing Methods.
-  (2018) SMASH: one-shot model architecture search through hypernetworks. In ICLR, Cited by: Introduction.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: Introduction, Meta-learning.
-  (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: Evaluation Metrics.
-  (2018) Attentive task-agnostic meta-learning for few-shot text classification. Cited by: Competing Methods.
Siamese neural networks for one-shot image recognition. In
ICML deep learning workshop, Cited by: Meta-learning.
-  (2016) A persona-based neural conversation model. In ACL, pp. 994–1003. Cited by: Introduction, Competing Methods.
-  (2016) StalemateBreaker: A proactive content-introducing approach to automatic human-computer conversation. In IJCAI, pp. 2845–2851. Cited by: Introduction.
-  (2019) Personalizing dialogue agents via meta-learning. In ACL, pp. 5454–5459. Cited by: Introduction, Introduction, Dataset, Competing Methods, Evaluation Metrics.
Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In IJCAI, pp. 3151–3157. Cited by: Introduction, Personalized Conversation Systems.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: Evaluation Metrics.
Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. Cited by: Implementation Details.
-  (2019) Domain adaptive dialog generation via meta learning. In ACL, pp. 2639–2649. Cited by: Introduction, Personalized Conversation Systems.
-  (2016) Optimization as a model for few-shot learning. Cited by: Meta-learning.
Meta-learning with memory-augmented neural networks.
International conference on machine learning, pp. 1842–1850. Cited by: Meta-learning.
-  (2016) A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069. Cited by: Introduction.
-  (2017) Prototypical networks for few-shot learning. In NeuralPS, pp. 4077–4087. Cited by: Meta-learning.
-  (2018) Learning to compare: relation network for few-shot learning. In CVPR, pp. 1199–1208. Cited by: Meta-learning.
-  (2014) Sequence to sequence learning with neural networks. In NeuralPS, pp. 3104–3112. Cited by: Introduction.
-  (2017) Attention is all you need. In NeuralPS, pp. 5998–6008. Cited by: Competing Methods.
-  (2016) Diverse beam search: decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424. Cited by: Evaluation Metrics.
-  (2016) Matching networks for one shot learning. In NeuralPS, pp. 3630–3638. Cited by: Meta-learning.
-  (2019) Function space particle optimization for bayesian neural networks. In ICLR, Cited by: Introduction.
Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. arXiv preprint arXiv:1508.01755. Cited by: Introduction.
-  (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745. Cited by: Introduction.
-  (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. In ACL, pp. 2204–2213. Cited by: Introduction, Dataset.
-  (2019) Fast context adaptation via meta-learning. In ICML, pp. 7693–7702. Cited by: Introduction.