1 Introduction
Automated reply suggestions or smartreplies (SR) are increasingly becoming common in many popular applications such as Gmail (2016), Skype (2017), Outlook (2018), LinkedIn (2017), and Facebook Messenger.
Given a message, the problem that SR solves is to suggest short and relevant responses that a person may select with a click to avoid any typing. For example, for a message such as Want to meet up for lunch? an SR system may suggest the following three responses {Sure; No problem!; Ok}. While these are all relevant suggestions, they are semantically equivalent. In this paper, we consider how we can diversify the suggestions such as with {Sure; Sorry I can’t; What time?} without losing any relevance. Our hypothesis is that encompassing greater semantic variability and intrinsic diversity will lead to higher clickrates for suggestions.
Smartreply has been modeled as an sequencetosequence (S2S) process Li et al. (2016); Kannan et al. (2016); Vinyals and Le (2015) inspired by their success in machine translation. It has also been modeled as an Information Retrieval (IR) task Henderson et al. (2017). Here, replies are selected from a fixed list of responses, using two parallel Matching networks to encode messages and replies in a common representation. Our production system uses such a Matching architecture.
There are several practical factors in favor of the MatchingIR
approach. Production systems typically maintain a curated responseset (to have better control on the feature and to prevent inappropriate responses) due to which they rarely require a generative model. Moreover, inference is efficient in the matching architecture as vectors for the fixed response set can be precomputed and hashed for fast lookup. Qualitatively, S2S also tends to generate generic, and sometimes incorrect responses due to label and exposure bias. Solutions for S2S during training
Wiseman and Rush (2016) and inference Li et al. (2016) have high overhead. Matching architectures on the other hand, can incorporate a global normalization factor during training to mitigate this issue Sountsov and Sarawagi (2016).In practice we found that the Matching model retrieves responses which are semantically very similar in lexical content and underlying intent as shown in (Table 1
). This behavior is not surprising and even expected since we optimize the model as a point estimation on golden messagereply (mr) pairs. In fact, it illustrates the effectiveness of encoding similar intents in the common feature space. While this leads to individual responses being highly relevant, the model needs to diversify the responses to improve the overall relevance of the set by covering a wider variety of intents. We hypothesize that diversity would improve the click rates in our production system. This is the main focus of this paper. We provide two baselines approaches using lexical clustering and maximal marginal relevance (MMR) for diversification in the Matching model.
Since we typically do not have multiple responses in oneonone conversational data (and thus cannot train for multipleintents), we consider a generative Latent Variable Model (LVM) to learn the hidden intents from individual mr pairs. Our key hypothesis is that intents can be encoded through a latent variable, which can be then be utilized to generate diverse responses.
To this end, we propose the MatchingCVAE (MCVAE) architecture, which introduces a generative LVM on the MatchingIR model using the neural variational autoencoder (VAE) framework Kingma and Welling (2014). MCVAE is trained to generate the vector representation of the response conditioned on the input message and a stochastic latent variable. During inference we sample responses for a message and use voting to rank candidates. To reduce latency, we propose a constrained sampling strategy for MCVAE which makes variational inference feasible for production systems. We show that the Matching architecture maintains the relevance advantages and inferenceefficiency required for a production system while CVAE allows diversification of responses.
We first describe our current production model and diversification approaches. Next, we present our key contribution: MatchingCVAE. Finally we report on our results from offline and online experiments, including production system performance.
2 Matching Model
Our training data consists of message reply (mr) pairs from oneonone IM conversations^{1}^{1}1Multiuser conversations were difficult to align reliably, given highly restricted access to preserve our users privacy.. A parallel stack of embedding and bidirectional LSTM layers encodes the raw text of mr by concatenating the last hidden state of the backward and forward recurrences as and (Figure 1). The encodings are trained to map to a common feature representation using the symmetricloss: a probabilistic measure of the similarity as a normalized dot product in equation 1. We maximize the during training.
Note the denominator in the symmetricloss is different from a softmax (where the marginalization is usually over the terms) to approximate
. Instead, it the sums over each message w.r.t. all responses and viceversa. This normalization (analogous to a Jaccard index) in both directions enforces stronger constraints for a dialog pair
^{2}^{2}2Li et al. (2016) made a similar argument with Mutual Information penalty during inference.. Thus, it is more appropriate for a conversational model where the goal is conversation compatibility rather than content similarity. Symmetric loss improved the relevance in our model. We omit the results here, to focus on diversity.(1)  
(2) 
During inference, we precompute the response vectors for a fixed response set . We encode an input as , and find the nearest responses , using a score composed of the dot product of and and a languagemodel penalty ^{3}^{3}3We train an LSTM language model on the training data. in equation 2. The is intended to suppress very specific responses similar to Henderson et al. (2017). The parameter is tuned separately on an evaluation set. We deduplicate the candidates and select top three as suggested replies. The training and inference graph is shown in Figure 1.
2.1 Response Diversification
The matching model by itself, retrieves very similar responses as shown in Table 1. Clearly, the responses need to be deduplicated to improve the quality of suggestions. We present two baseline approaches to increase diversity.
Lexical Clustering (LC): Table 1, motivates the use of simple lexical rules for deduplication. We cluster responses which only differ in punctuations (Thanks!, Thanks.), contractions (cannot:can’t, okay:ok), synonyms (yeah, yes, ya) etc. We further refine the clusters by joining responses with oneword edit distance between them (Thank you so much. Thank you very much) except for negations. During inference, we deduplicate candidates belonging to the same clusters.
Maximal Marginal Relevance (MMR): As a way to increase the diversity in IR, Carbonell and Goldstein (1998) introduced the MMR criterion to penalize the querydocument similarity with interdocument similarity to rank candidates using marginal relevance.
In the context of the SR, we apply the MMR principle as follows. First, we select the candidates, (with scores and response vectors )) using equation 2. Next, we compute the the novelty (or marginal relevance) of the response with respect to the other candidates using equation 3. Finally, we rerank the candidates using the MMR score computed from equation 4. Our MMR implementation is an approximation of the original (which is iterative). Nevertheless, it allows the ranking in one single forward pass and thus is very efficient in terms of latency.
(3)  
(4) 
Table 3 shows that LC and MMR are quite effective at reducing duplicates. We have also explored other clustering approaches using embeddings from unsupervised models, but they were not as effective as LC or MMR.
3 MatchingCVAE (MCVAE)
Neither MMR nor LC solves the core issue with diversification i.e., learning to suggest diverse responses from individual mr pairs. Privacy restrictions prevent any access to the underlying training data for explicit annotation and modeling for intents. Instead, we model the hidden intents in individual mr pairs using a latent variable model (LVM) in MCVAE.
In MCVAE we generate a response vector conditioned on the message vector and a stochastic latent vector. The generated response vector is then used to select the corresponding raw response text.
MCVAE relies on two hypotheses. First, the encoded vectors are accurate distributional indexes for raw text. Second, the latent variable encodes intents (i.e. a manifold assumption that similar intents have the same latent structure). Thus, samples from different latent vectors can be used to generate and select diverse responses within the MatchingIR framework.
We start with a base Matching model which encodes an mr pair as and . We assume a stochastic vector which encodes a latent intent, such that is generated conditioned on and
. The purpose of learning the LVM is to maximize the probability of observations
by marginalizing over . This is typically infeasible in a high dimensional space.Instead, the variational framework seeks to learn a posterior and a generating function to directly approximate the marginals. In the neural variational framework Kingma and Welling (2014) and the conditional variant CVAE Sohn et al. (2015), the functionals and are approximated using nonlinear neural layers^{4}^{4}4Also referred as inference/recognition and reconstruction networks, they appear like an autoencoder network., and trained using Stochastic Gradient Variational Bayes (SGVB).
We use two feed forward layers for and as shown in equations 5 and 7. Here, denotes the concatenation of two vectors. To sample from , we use the reparameterization trick of Kingma Kingma and Welling (2014)
. First, we encode the input vectors interpreted as mean and variance
. Next, we transform to a multivariate Gaussian form by sampling, and apply the linear transformation in equation
6. We reconstruct the response vector as with (equation 7). Figure 2 shows the complete MCVAE architecture.The network is trained with the evidence lower bound objective (ELBO) by conditioning the standard VAE loss with response in equation 8
. The first term can be computed in closed form as it is the KL Divergence between two Normal distributions. The second term denotes the reconstruction loss for the response vector. We compute the reconstruction error using the symmetric loss,
from equation 1 in the training minibatch. As is standard in SGVB, we use only one sample per item during training.(5)  
(6)  
(7)  
(8)  
(9) 
3.1 Inference in CVAE
During inference, We precompute the response vectors and scores as before. However, instead of matching the message vector with the response vectors, we find the nearestneighbors of the generated response vector, with . Next, we use a sampling and voting strategy to rank the response candidates.
Sampling Responses: To generate , we first sample , concatenate with and generate with the decoder from equation 7. The sampling process is shown Figure 2 (right).
Voting Responses: The predicted response sample for a given input and a sample is given by equation 9. In each sample, a candidate response (argmax) gets the winning vote. We generate a large number of such samples and use the total votes accumulated by responses as a proxy to estimate the likelihood . Finally, we use the votingscore to rank the candidates in MCVAE.
3.2 Constrained sampling in CVAE
To deploy MCVAE in production we needed to solve two issues. First, generating a large number of samples significantly increased the latency compared to Matching. Reducing the number of samples leads to higher variance where MCVAE can sometimes select diverse but irrelevant responses (compared to Matching which selects relevant but duplicate responses). We propose a constrained sampling strategy which solves both these problems by allowing better trade off between diversity and relevance at a reduced cost.
We note that the latency bottleneck is essentially in the large dot product with precomputed response vectors (our response set size is ~30k) in equation 9. Here, the number of matrix multiplications for samples is (with encoding dimension size of 600). However, during the sampling process, only a few relevant candidates actually get a vote. Thus, we can reduce this cost by preselecting top candidates using the Matching score (eq. 2) and then pruning the response vector to the selected candidates. This constrains the dotproduct in each sampling step to only vectors, and reduces the number of matrix multiplications for samples to , where .
By pruning the response set, we are able to fit all the sampling vectors within a single matrix, and apply the entire sampling and voting step as matrix operations in one forward pass through the network. This leads to an extremely efficient graph and allows us to deploy the model in production.
Sampling with MMR: As seen in Table 1, the candidates selected using Matching score can have very low diversity to begin with and can reduce the effectiveness of MCVAE. To diversify the initial candidates, we can use our MMR ranking approach as follows. We first select top responses using Matching and compute the MMR scores from equation 4. Next, we use the MMR scores to select the top diverse responses for use in constrained sampling in MCVAE.
All the inference components (Matching, MMR, and constrained sampling), when applied together requires just one forward pass through the network. Thus, we can not only tradeoff diversity and relevance, but also control the latency at the same time. Constrained sampling was critical for deploying to production systems.
4 Experiments and Results
Our current production model in Skype is a parallel Matching stack (Figure 1) with embedding size of and 2 BiLSTM layers with hidden size of for both messages and replies. The token vocabulary is ~100k (tokens with a minimum frequency of 50 in the training set), and the response set size is ~30k. It selects top candidates and deduplicates using lexical clustering to suggest three responses. The entire system is implemented on the Cognitive Toolkit CNTK which provides efficient training and runtime libraries, particularly suited to RNN based architectures.
We analyze the MCVAE model in comparison to this production model ^{5}^{5}5Since the production model has gone through numerous parameter tuning and flights, we assume that to be a strong baseline to compare with.. The production model is also used as the control for online A/B testing, so it is natural to use the same model for offline analysis. To train the MCVAE, we use the base Matching model, freeze its parameters, and then train the CVAE layers on top. We apply a dropout rate of 0.2 after the initial embedding layer (for both Matching and MCVAE) and use the Adadelta learner for training. We use the loss on a held out validation set for model selection.
Training data: We sample ~100 million pairs of mr pairs from oneonone IM conversations. We filter out multiuser and multiturn conversations since they were difficult to align reliably. We set aside 10% of the data to compute validation losses for model selection. The data is completely eyesoff i.e., neither the training nor the validation set is accessible for eyeson analysis.
Response set: To generate the response set, we filter replies from the mr pairs with spam, offensive, and English vocabulary filters and clean them of personally identifiable information. Next, we select top 100k responses based on frequency and then top 30k based on lmscores. We precompute the lmscores, lexicalclusters and encodings for the response set and embed them inside the inference graphs as shown in Figure 1 and 2.
Evaluation metrics and set: The model predicts three responses per message for which we compute two metrics: Defects (a response is deemed incorrect) and Duplicates (at least 2 out of 3 responses are semantically similar). We use crowd sourced human judgments with at least 5 judges per sample. Judges are asked to provide a binary Yes/No answer on defects and duplicates. Judge consensus (inter annotator agreement) of 4 and above is considered for metrics, with 3 deemed as noconsensus (around 5%). Since training/validation sets are not accessible for analysis, we created an evaluation set of 2000 messages using crowd sourcing for reporting our metrics.
MCVAE parameters: We consider three parameters for ablation studies in MCVAE: size of latent vector , number of samples and the response pruning size for constrained sampling. The results are shown in Table 2. The MCVAE numbers (row 2 onwards) are relative to the base Matching model in row 1. First, row 2 shows that latent vector size of 256 provides a suitable balance between defects and duplicates, but in general, the size of latent variable is not a significant factor in performance. Next, in row 3, we see that the responsepruning size , is an effective control to tradeoff defects and duplicates. Thus constrained sampling not only reduces the latency but also provides quality control required in a production system. In row 4, we see that more samples lead to better metrics but the improvements are marginal beyond 300 samples. In all cases, MCVAE significantly reduces duplicates (by as much as 40%) without any major increase in defects. We select the model with hyperparameters for further analysis.
Diversification with LC: The first two rows of Table 3 analyzes the impact of LC based deduplication. LC can significantly reduce the duplicates in the base matching model. However, MCVAE (even without LC) reduces the rates by almost 50% as shown in column 4 in row 1. Using LC as a post processing step after MCVAE, can give further boosts in diversity (row 2).
Diversification with MMR: Table 3 also reports the impact of MMR reranking. For Matching+MMR, duplicates can reduce significantly as we increase the parameter, but at the cost of increased defects. With MMR+MCVAE, further diversification can be achieved, and typically at a lower defect rate. This shows the advantage of using MCVAE which conditions the responses on the message and hence has stronger controls on the relevance than MMR.
Comparison with other architectures: We have considered two other architectures for our SR system. First is a standard S2S with attention Bahdanau et al. (2014) with equivalent parameters for embedding and LSTMs as our base model, and inference using beam search decoding with width 15. Second, is a feedforward (instead of an LSTM) based Matching encoder architecture which is equivalent to the one in Henderson et al. (2017). All models use LC for deduplication after 15 candidate responses are selected. Table 4 validates our architectural preference towards Matching/BiLSTM which has a superior performance in terms of defects.
Inference latency: Architecture choices were also driven by latency requirements in our production system. The results are summarized in Table 5 for different architectures. S2S and unconstrained sampling in MCVAE were unsuitable for production due to their high latencies. With constrained sampling (including MMR), the latency increases marginally compared to the base model, and allows us to put the model in production.
Online experiments: Offline metrics were used principally for selecting the best candidate models for online A/B experiments. We selected MCVAE model with parameters [z=256, k=15, s=300] from Table 2. Using our existing production model as the control, and a treatment group consisting of 10% of our IM client users (with the same population properties as the control), we conducted an online A/B test for two weeks. Table 6 shows that the clickrate for MCVAE compared to the Matching model increased by ~5% overall.
Gains were driven by the increase in the 2nd (10.3%) and 3rd (6.7%) suggested reply positions with virtually no impact in the 1st position. This correlates with our offline analysis since MCVAE typically differs from the base model at these two positions. Intuitively, the three positions point to the head, torso and tail intents of responses ^{6}^{6}6which was validated by the absolute click rates for each of these positions but not shown in the table. Gains at these positions show that MCVAE extracts diverse responses without sacrificing the relevance of these tail intents.
Driven by these gains, we have switched our production system in Skype to use MCVAE for 100% of users.
5 Related work
Several researchers have used CVAEs Sohn et al. (2015) for generating text Miao et al. (2016); Guu et al. (2018); Bowman et al. (2016), modeling conversations Park et al. (2018), diversifying responses in dialogues Zhao et al. (2017); Shen et al. (2017) and improving translations Schulz et al. (2018). These papers use S2S architectures which we found impractical for production. We demonstrate similar objectives without having to rely on any sequential generative process, in an IR setting.
VAE has been also used in IR Chaidaroon and Fang (2017) to generate hash maps for semantically similar documents and topn recommendation systems Chen and de Rijke (2018). In contrast, we demonstrate semanticdiversity in intents in a conversational IR model with MCVAE.
Novelty and diversity are wellstudied problems in IR Yue and Joachims (2008); Clarke et al. (2008) where it is assumed that document topics are available (and not latent) during training. Diversification effect as shown in Chen and Karger (2006) relies on relevance (click) data, and thus is not directly applicable in our system. MMR Carbonell and Goldstein (1998) is a relevant prior work which we use as a baseline.
6 Conclusions
We formulate the IRbased conversational model as a generative LVM, optimized with the CVAE framework. MCVAE learns to diversify responses from single mr pairs without any supervision. Online results show that diversity increases the click rates in our system. Using efficient constrained sampling approach, we have successfully shipped the MCVAE to production.
Increase in click rates over millions of users is incredibly hard. We have also experimented with the MCVAE model trained for suggesting replies to emails in Outlook Web App (significantly different characteristics than IM) and seen similar gains. The results across domains suggests strong generalization properties of the MCVAE model and validates our hypothesis that increased diversity leads to higher clickrates by encompassing greater semantic variability of intents.
Perhaps the most important quality of the MCVAE is that response vector can be flexibly conditioned on the input and thus a transduction process. In contrast, in the Matching IR model, response vectors are precomputed and independent of the input. MCVAE thus opens up new avenues to improve the quality of responses further through personalization and stylization. This is the subject of future work.
Acknowledgments
We gratefully acknowledge the contributions of Lei Cui, Shashank Jain, Pankaj Gulhane and Naman Mody in different parts of our production system on which this work builds upon. We also thank Chris Quirk and Kieran McDonald for their insightful feedback during the initial development of this work. Finally, we thank our partner teams (Skype, infrastructure, and online experimentation) for their support.
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.
 Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In CoNLL.
 Carbonell and Goldstein (1998) Jaime Carbonell and Jade Goldstein. 1998. The Use of MMR, Diversitybased Reranking for Reordering Documents and Producing Summaries. In SIGIR.
 Chaidaroon and Fang (2017) Suthee Chaidaroon and Yi Fang. 2017. Neural Variational Inference for Text Processing. In SIGIR.
 Chen and Karger (2006) Harry Chen and David R. Karger. 2006. Less is more. Probabilistic models for retrieving fewer relevant documents. In SIGIR.
 Chen and de Rijke (2018) Yifan Chen and Maarten de Rijke. 2018. A Collective Variational Autoencoder for TopN Recommendation with Side Information. CoRR, abs/1807.05730.
 Clarke et al. (2008) Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and Diversity in Information Retrieval Evaluation. In SIGIR.
 (8) CNTK. The Microsoft Cognitive Toolkit. https://www.microsoft.com/enus/cognitivetoolkit/.
 Guu et al. (2018) Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating Sentences by Editing Prototypes. Transactions of the Association of Computational Linguistics, 6:437–450.
 Henderson et al. (2017) Matthew Henderson, Rami AlRfou’, Brian Strope, YunHsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient Natural Language Response Suggestion for Smart Reply. CoRR, abs/1705.00652.
 Kannan et al. (2016) Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Gregory S. Corrado, László Lukács, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. In KDD.
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. AutoEncoding Variational Bayes. ICLR.
 Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B. Dolan. 2016. A DiversityPromoting Objective Function for Neural Conversation Models. In HLTNAACL.
 Miao et al. (2016) Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural Variational Inference for Text Processing. In ICML.
 Microsoft (2018) Microsoft. 2018. Designed to be fast  The Outlook on the web user experience gets simpler and more powerful.
 Park et al. (2018) Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A Hierarchical Latent Structure for Variational Conversation Modeling . In NAACL.
 Pasternack and Chakravarthi (2017) Jeff Pasternack and Nimesh Chakravarthi. 2017. Building Smart Replies for Member Messages.

Schulz et al. (2018)
Philip Schulz, Wilker Aziz, and Trevor Cohn. 2018.
A Stochastic Decoder for Neural Machine Translation.
In ACL.  Shen et al. (2017) Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A Conditional Variational Framework for Dialog Generation. In ACL.
 SkypeTeam (2017) SkypeTeam. 2017. Introducing Cortana in Skype.
 Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In NIPS.
 Sountsov and Sarawagi (2016) Pavel Sountsov and Sunita Sarawagi. 2016. Length bias in Encoder Decoder Models and a Case for Global Conditioning. In EMNLP.

Vinyals and Le (2015)
Oriol Vinyals and Quoc V. Le. 2015.
A Neural Conversational Model.
In
ICML Deep Learning Workshop
.  Wiseman and Rush (2016) Sam Wiseman and Alexander M. Rush. 2016. SequencetoSequence Learning as BeamSearch Optimization. In EMNLP.
 Yue and Joachims (2008) Yisong Yue and Thorsten Joachims. 2008. Predicting Diverse Subsets Using Structural SVMs. In ICML.
 Zhao et al. (2017) Tiancheng Zhao, Ran Zhao, and Maxine Eskénazi. 2017. Learning Discourselevel Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In ACL.
Comments
There are no comments yet.