A Conditional Generative Matching Model for Multi-lingual Reply Suggestion

09/15/2021 ∙ by Budhaditya Deb, et al. ∙ Microsoft 9

We study the problem of multilingual automated reply suggestions (RS) model serving many languages simultaneously. Multilingual models are often challenged by model capacity and severe data distribution skew across languages. While prior works largely focus on monolingual models, we propose Conditional Generative Matching models (CGM), optimized within a Variational Autoencoder framework to address challenges arising from multi-lingual RS. CGM does so with expressive message conditional priors, mixture densities to enhance multi-lingual data representation, latent alignment for language discrimination, and effective variational optimization techniques for training multi-lingual RS. The enhancements result in performance that exceed competitive baselines in relevance (ROUGE score) by more than 10% on average, and 16% for low resource languages. CGM also shows remarkable improvements in diversity (80%) illustrating its expressiveness in representation of multi-lingual data.



There are no comments yet.


page 7

page 8

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated reply suggestion (RS) helps users quickly process Email and chats, in popular applications like Gmail, Outlook, Microsoft Teams, and Facebook Messenger, by selecting a relevant reply generated by the system, without having to type in the response. Most existing RS systems are English mono-lingual models Kannan et al. (2016); Henderson et al. (2017); Deb et al. (2019); Shang et al. (2015). We study the problem of creating multilingual RS models serving many languages simultaneously. Compared to mono-lingual models, a universal multilingual model offers several interesting research questions and practical advantages.

Universal models can save compute resources and maintenance overhead for commercial systems supporting many regions. In addition it can benefit languages with insufficient data by information sharing from high resource languages and thus enhance experiences for users especially in low-language resource regions. We investigate if a single multilingual RS model can replace multiple mono-lingual models with better performance, while overcoming the challenges in model capacity, data skew, and training complexities.

Trivially extending existing mono-lingual RS models to the multilingual setting (e.g. by jointly training with pre-trained multi-lingual encoders) tends to be sub-optimal, as multilingual models suffer from capacity dilution issue Lample and Conneau (2019), where it improves performance on low resource languages while hurting the high resource ones. This arises, not only due to the severe data imbalance and distribution skew across languages, but also due to insufficient capacity and lack of inductive biases in models to represent the multi-modal distribution of languages. We postulate that deep generative latent variable models with variational auto-encoders (VAE) Kingma and Welling (2014) are better suited to model the complex distribution of multi-lingual data, and be more data efficient for low resource languages.

(a) Generative model for MCVAE

(b) Generative model for CGM

(c) Generative model for CGM-M
Figure 1: RS generative models in the continuous space. Text M-R pairs (in discrete space) are encoded into a common continuous space (), where the encoders outputting are considered extraneous to the generative model. The generative process is in the continuous space, with generated conditioned on the input and a Gaussian prior . The figures show three variations of this generative process. In prior work MCVAE, is sampled independently, while in CGM, it is conditional on . CGM-M extends the message conditional prior with a Gaussian Mixture prior and a categorical prior .

To this end, we propose the Conditional Generative Matching Model (CGM), a VAE based retrieval architecture for RS to solve the above challenges. CGM enhances multilingual representation through: 1) expressive message conditional priors, 2) multi-component mixture density to represent different modalities of languages, and 3) alignment of latent components for language discrimination. In addition CGM incorporates training optimizations in the form of 1) loss regularizer, 2) learnable weights for loss components, 3) multi-sample loss estimation with variance scaling, and 4) focal loss, all of which lead to balanced representation and smooth convergence, a key challenge for variational training in multilingual settings.

We conducted extensive ablation studies and comparisons with two competitive baselines to show the impact of the above optimizations. Universal CGM models improve the relevance of RS (up to 13% excluding English) with even higher gains coming for low resource languages (16%), and when using CGM in a monolingual setting (19%). CGM also dramatically increases the diversity of suggested replies by 80% which is more illustrative of the improved representational capability of CGM in the multi-lingual landscape. CGM achieves this with relatively small increase in model sizes compared to the large pre-trained transformer stacks on which it is built, showing the modeling efficiencies that can be achieved through efficient training of latent variable models in a multi-lingual setting.

2 Background and Preliminaries

While RS has been modeled as a sequence to sequence model Kannan et al. (2016), it more commonly appears as an information retrieval (IR) system by ranking responses from a fixed set Henderson et al. (2017, 2019); Ying et al. (2021); Swanson et al. (2019); Zhou et al. (2016, 2018) due to better control over quality and relevance for practical systems. We briefly describe two retrieval architectures from prior literature which serves as the baselines for our multilingual RS model.

Matching model Henderson et al. (2017); Ying et al. (2021) consists of two parallel encoders to encode message and reply (M-R) pairs into a common encoding space, and trained to maximize a normalized dot product between the M-R encodings. During prediction, the model finds the nearest neighbors of with precomputed encodings from a fixed response set . A language model bias is typically added to promote more common responses. The matching architecture is summarized as:


Matching Conditional VAE (MCVAE) Deb et al. (2019) induces a deep generative latent variable model on the matching architecture, where a candidate response encoding is generated with conditioned on a latent prior . The generated

is used to match an actual response vector

from the fixed response set. The generative model of MCVAE is shown in figure 0(a). In MCVAE, the encoders are pretrained using the matching formulation and kept frozen during the training. For prediction, MCVAE samples response vectors from followed by scoring (eq 2) and a voting technique to rank replies over a fixed response set. MCVAE is trained in the variational framework by minimizing the negative evidence lower-bound (ELBO) in equation 3 with a Gaussian posterior (mean and co-variance parameterized from ) and the reconstruction loss defined by Eq. (1).


We extend the Matching and MCVAE models to a multi-lingual setting by using pretrained multi-lingual BERT (MBERT) Devlin et al. (2019) for similar to Ying et al. (2021) and jointly training the models for all languages.

3 CGM: A Conditional Generative Matching Model for Reply Suggestion

Our initial analysis with universal models (jointly training models with all languages), reveals that the universal MCVAE performs better than Matching. However, simply training models jointly is not sufficient to achieve a models with high performance. First, the highly imbalanced nature of multi-lingual data leads to over- or under-fitting across languages resulting in performance worse than separately trained mono-lingual models. Second, training multi-lingual MCVAE proved is due to the reliance on a pretrained Matching model: it is not clear how to find a suitable Matching model checkpoint for initializing the MCVAE. Finally, since the text encoders for MCVAE are frozen during training, there is limited cross lingual transfer and improvement for low resource languages. Unfreezing the layers led to divergence of the model.

To address the limitations of MCVAE, we propose an enhanced Conditional Generative Matching (CGM) model, for the retrieval based RS with inductive biases for the multi-lingual data and effective training techniques for creating high quality universal models.

3.1 Message Conditional Prior

The implied generative process in MCVAE (Fig. 0(a)), is , where the latent prior is sampled independent of the message encoding . However, in RS since is always observed, ideally one would like to sample from to capture message-dependent information as well as rich multi-modality of the input space, particularly for multi-lingual data. In addition, although MCVAE works well empirically in the mono-lingual setting Deb et al. (2019), the samples from in general are not the same as , unless is uniform across the space of . This is a restrictive assumption, which motivates us to consider a prior conditioned on the input for the generative model, by decomposing


as shown in Figure 0(b). The conditional prior is posed to encode message dependent information which can facilitate matching more relevant and diverse set of responses. We define the message-conditional prior , where the prior parameters are learnt from data during training and used for prediction, to maximally capture the multiple modalities of intents and intrinsically complex distribution of multi-lingual data.

3.2 Prior with Mixture Density (CGM-M)

We postulate that a more expressive conditional prior, such as a mixture density, can better capture the multi-lingual data in contrast to the single prior density as used above. I.e., the different components of a mixture density can represent different languages and allow independent representation across languages. To this end we extend the message conditional prior with a Gaussian Mixture model (GMM) as,


where , are the message dependent means and diagonal covariances for the th component of the GMM, and are the message dependent prior mixing coefficients. We hypothesize that components would correspond to different intents and languages, thus providing additional inductive bias for multi-lingual data. We refer to the mixture variant as CGM-M (Figure 0(c)).

3.3 Aligning Latent Space to Language

To further reinforce the notion that the CGM-M latent components encode language specific information from M-R pairs, we pose an additional constraint that the language of the message be inferred from the prior mixture coefficient. This is instantiated by building a simple classifier network with loss

to map the prior mixture coefficient onto the language of the message. We also tested with mapping the 1) means and variances , and 2) samples of the GMM, and found that mapping the leads to the best results. The classifier is learned jointly with the rest of the components.

3.4 Variational Training Architecture

The CGM models are formulated as a VAE in the continuous space of . CGM includes two multi-lingual text encoders , to convert the raw text of M-R into the common encoding space (encoders may be considered extraneous to the VAE but are learnt jointly with VAE layers), and a VAE with prior, posterior, and generation networks .

The CGM-M extends the CGM version with category specific Gaussian components In addition it also includes a categorical prior and posterior , and a language classifier to discriminate between languages. We use the standard reparameterization trick for the Gaussian variables and the Gumbel-Softmax trick  Jang et al. (2017)

with hard sampling for the categorical variable. CGM-M (CGM is a special case with

) is summarized as follows.


Above, we expand the dimensions of projection vectors to where is the dimension of the forward projections and is the number of categories in the mixture. After the category is selected (using Gumbel Softmax), we use the category index to select part of the expanded projections, as the component of the means and variances . Each denotes a two-layer feed-forward network (except which has 3 layers) with tanh activation and denotes vector concatenation.

Note that the posteriors are conditioned on both and . This theoretically provides a richer representation of the M-R pairs and during inference allows us to score the combination of message and the selected response vectors. However, during training, it can lead to leakage through the network where the model simply ignores the message and uses the response vector for generation. We mitigate the leakage by applying a low-dimensional projection of response vector before feeding into the variational network.

Following standard stochastic gradient variational bayes (SGVB) training, we minimize the negative ELBO to train the network. CGM-M adds the classifier loss to enforce alignment between latent vectors and language types. The training objectives for each are given as follows,


where the reconstruction log-loss, is given by Eq. (1). For CGM, the KL divergence between the two multivariate Gaussian densities can be computed in closed form. However, for CGM-M, the KL divergence between two Gaussian mixtures does not admit a closed form. We estimate it with a variational approximation method described in Hershey and Olsen (2007)111Another approach with Monte-Carlo sampling requires a large number of samples and was not as effective..


3.5 Training Optimizations

Training deep generative models with SGVB has been known to be notoriously tricky Bowman et al. (2016); Fu et al. (2019). Our multilingual setting, and joint training of text encoders with VAE layers makes it even more challenging. We employed several optimizations to improve the convergence of the models.

1) Matching loss regularization: In CGM, the encoders for are learnt jointly with the VAE layers in order to maximize richness of shared latent representation across languages. Thus is a moving target for the VAE generator outputting and causes the training to diverge without additional constraints. In MCVAE, this was mitigated by initializing and freezing the text encoders from a trained Matching model, but can be counter-productive in the multilingual scenario. To enable joint training of text encoders and the VAE layers, and mitigate the issue of a moving target for reconstruction, we introduce a regularization in the form of a matching score between and ,


which constrains the response vector to have a representation close to the message vector. This provides an independent anchor for the reconstruction and allows the end-to-end training of the model utilizing the full parameter space of the encoders for enhanced representation.

2) Multi-sample variance scaling: In SGVB, using a single sample of usually results in high variance in the ELBO estimate. One remedy is to estimate the ELBO with multiple samples, either in the non-weighted and or importance weighted Burda et al. (2016) versions. However, these led to only minor improvements.

In multi-sample training we take the expectation of the ELBO over the samples. We found that if instead we first take the expectation of the samples before computing the ELBO loss, we can reduce the variance and stabilize the training. Since follows an equivalent distribution , we can estimate ELBO with multiple samples drawn from the scaled distribution and compute the expectation as follows. The adjustment provides significant improvements in training convergence and metrics.


3) Weighting loss components with Homo-scedastic Uncertainty (HSU): The final loss formulations for both CGM and CGM-M have several components. For finer control of training, we introduce learnable weights for each of the components. Weighting different components of the ELBO loss has shown to improve performance Higgins et al. (2017) in SGVB and thus even without additional components, such a weighting process is recommended.

Following Cipolla et al. (2018), we view the loss formulation as a multi-task learning objective with different homo-scedastic uncertainties (HSU) for each task. Assuming the components factorize to Gaussian (continuous) and discrete (cross-entropy) likelihoods, the loss with HSU can be viewed as:


Equating the uncertainties with the weights in our loss equation, this can be seen as learning the relative weights for each component where and provides a smooth, regularized and differentiable interpretation of weights. We introduce the weights as parameters in the model and learn them jointly with rest of the network.

4) Handling data skew with Focal Loss (FL): Multilingual training can have different convergence rates across languages and akin to behaviors observed in multi-modal training Wang et al. (2020b). Carefully configured sampling ratios for different languages can alleviate this problem but requires costly hyper-parameter search. Instead we employ a popular technique for handling skewed data distribution: the focal loss (FL) Lin et al. (2020).


The FL (with

) is applied on the reconstruction log-probability component of ELBO, such that strongly reconstructed vectors are given lower weights than the weakly reconstructed ones which balances the convergence across languages.

3.6 Prediction and Ranking Responses

During prediction, we rank and select responses from a fixed response set . Since the models generate response vectors in the continuous space, the prediction process needs to convert the samples into ranking in the discrete space of responses. The process is described as follows.


For each message we generate 1000 samples of latent conditional priors from and from categorical prior for CGM-M. Next, we generate samples of the response vectors using the generator network, . We compute the scores for the generated sample w.r.t to the fixed response set in eq. 23, where the KL divergence is directly computed on the samples under a Normal or GMM distribution for the prior and posterior. To reduce the scoring overhead over 40k responses with 1000 samples, we pre-select top ( provides sufficiently diverse candidates) using the matching score (eq. 2). Finally, the mean reciprocal ranks (MRR) over all the samples (eq. 24) are used to select the top 3 as our predicted responses.

Latent Factors Cond. Prior Mix. Density Language alignment Multilingual training opts
Matching - - - - -
MCVAE - - - -
CGM - -
Table 1: Comparison of components of Matching, MCVAE (Sec 2), CGM, and CGM-M (Sec 3)
Figure 2: Main results. With the Matching monolingual models as baseline, the figures show the % changes in metrics for model variants (see Sec 4 for model description and Sec 4.1 for discussion). For each model variant, we show the metrics across three languages groups (All, w/o-EN and bottom 10 low resource languages. (Left) Relevance (Right) Diversity.

4 Experiments

Multi-lingual data: We use the MRS (Mulit-lingual Reply Suggestions) data set Zhang et al. (2021) for our experiments. MRS consists of message-reply (M-R) pairs separated into different languages from Reddit conversations Baumgartner et al. (2020) using the FastText detector (Joulin et al., 2016). We select the top 15 languages for experimentation (data volume was insufficient for others) with 80% split for training (2nd column in Table 4) and the rest for validation and test. We create response sets with most frequent responses (>20 frequency) in the m-r pairs. For low resource languages, we augment this natural set with machine translated responses from EN, resulting in k responses for each language.

Metrics: We use ROUGE Lin (2004) for scoring the relevance of the 3 predicted responses against the reference response. We also compute the self-ROUGE Celikyilmaz et al. (2020) within the 3 responses as a measure of diversity. For both, we report the average of the ROUGE-F1 for 1/2/3-grams across the three responses.

Train parameters: We use the multi-lingual version of the pretrained BERT model (MBERT) Devlin et al. (2019) as out text encoders for which we use the Huggingface’s transformers library Wolf et al. (2020). We freeze the embedding layer of MBERT encoders, which reduces training overhead, and preserves cross-lingual representation without impacting performance Lee et al. (2019); Peters et al. (2019). We use dimension size of 512 for the VAE layers. For CGM-M we set the number of categories to .

We train with the Adam optimizer (peak rate: , exp. decay: after warm up of 1000 steps), batch size of 256, and m-r pairs truncated to length 64 and 32 respectively. We add language tokens (e.g. EN, PT) before m-r pairs as additional language identifier. All the model sizes are relatively similar (1.3GB to 1.5GB) since most parameters are in the two MBERT encoders with 12 transformer layers (each around 700MB).

Multilingual training

: We uniformly sample languages such that models have equal exposure to each language during training. This leads to good performance across all languages except EN. Alternatively, sampling proportionate to data volumes, had good performance for EN but led to severe under-fitting for most languages other than EN as EN dominates the training with orders of magnitude more data. The ideal sampling is somewhere in between, but requires extensive search to optimize. On single NVidia V100 GPUs, models converge within 1-2 epochs

over the entire data (i.e., 1-2 epochs for EN and multiple epochs for others). Joint training amortizes the training costs, and can be used even when targeting monolingual models, by saving per-language checkpoints.

Model variants: We analyze 4 models: Matching, MCVAE, CGM and CGM-M (Table 1). For each we consider 3 multilingual model variants. [Mono]: individually trained monolingual models on each language. [Uni]: jointly trained universal model with a single checkpoint for evaluation. [Mono*]: jointly trained model with per language checkpoints (saved when the validation metrics peak for each language) for evaluation. Since models peak at different point for each language, Mono* is expected to have a better performance than the Universal counterpart with a single checkpoint.

4.1 Main Results

Figure 2 shows the relevance and diversity metrics for different model variants. With Matching-Mono models (trained individually per language) as the baseline, we plot the % changes in metrics for the other model variants. Models are trained on all languages, with relevance metrics shown in 3 language groups: 1) All, 2) All w/o EN, and 3) Bottom 10 low resource languages, to highlight the differences from data volumes in languages.222Here we present quantitative results. For qualitative analysis, multi-lingual text predictions are provided in the appendix.

Relevance (Figure 2-Left): Compared to individually trained monolingual Matching model, the universally trained Matching-Uni regresses on all the three language group while MCVAE-Uni improves for latter two groups (w/o EN and bottom 10 languages). The CGM-Mono improves the metrics across all three languages. Thus even without joint training, CGM by itself is better than the baselines and thus raises the bar which the universal models needs to match or overcome.

The CGM and CGM-M universal models improve on all the language groups although for the CGM-uni, there is regression in the All-languages group compared to the CGM-mono (more discussion later). However, CGM-M-Uni with around 5% increase is actually slightly better than CGM-mono, showing that we can replace the monolingual models with a single universal model. Next, the Mono* models (universally trained but with best per-language checkpoints saved) can achieve even bigger gains and CGM-M-Mono* surpasses other models in every language group.

Within language groups, we observe increase upto 16% without EN and upto 19% for bottom 10 languages. EN with two orders of magnitude more data, remains severely under-fitted in all the jointly trained model, due to which the metrics improvements in All languages group remains low.

Diversity (Figure 2-Right): The CGM performance is most striking for diversity metrics where we see 80% improvements. Diversity improvements more than the relevance gains, illustrate that deep generative modeling enhancements in CGM leads to richer representation of multilingual data with improved discrimination and disentanglement of language and latent intents in M-R pairs. CGM-M achieves high diversity on top of the best relevance metrics, showing the enhanced representation through mixture models.

Figure 3: Ablation studies for different training optimizations (Sec 3.5) with results discussed in Sec 4.2.
Figure 4: Relevance metrics across 15 languages. (Model description in Sec 4 and discussion in Sec 4.3)

4.2 Ablation Studies

We conducted extensive ablation studies with the different model variants, and training optimizations and summarize the results in Figure 3. For ablations we report the metrics for language group without EN, as the significantly higher data volume in EN can conflate the results.

Baselines: We use the Matching-uni model (line 1) as the baseline. MCVAE (line 2) improves both relevance (4.8%) and diversity (27%) which shows the potential of deep generative models.

Training optimizations with CGM: The basic CGM-Uni model (line 3) and CGM-M (Line 7) shows modest relevance gains compared to MCVAE. We attribute the modest gains due to complexities with end-to-end training of the CGM. Through training optimizations of variance scaling, and FL and HSU (lines 4, 5), CGM can comfortably surpass MCVAE in relevance (12.8%) and double the diversity (59%). CGM-M, shows similar increase (13.87%) with variance scaling (line 8), and FL and HSU (line 9) outperforming the best achieved with CGM. The biggest improvements come from multi-sample variance scaling (lines 4, 8) with additional improvements from FL and HSU (lines 5, 9). Overall, the optimizations lead to more stable training, and faster convergence across languages. They also alleviate the need for manual tuning for skewed data and loss component weights, making the training process virtually hyper-parameter free.

Language Mapping in CGM-M: One key reason for improved performance with CGM-M is the potential inductive bias for languages through the mixture components, which can be further boosted by explicit mapping of latent vectors to languages. Language mapping improves the relevance to 14.6% (line 10) over the baseline. We also see a slight boost in diversity showing the improved modeling of the multi-lingual distribution using this approach.

Posterior conditioned on both message and response: The joint conditioning of the posterior with both the vectors333We had excluded in the posterior of other configurations to show this effect. gives the best relevance for both CGM and CGM-M (lines 6, 11) with CGM-M exceeding all other variants. More interesting is the substantial improvement in diversity (80%), which illustrates that it encourages a richer representation in the prior by perhaps disentangling latent intents and language characteristics better. We note here that, in CGM-M, using the full dimension (768) led to high level of leakage through the posterior (multiple components of the mixture further aids the leakage). We use a low dimensional projection of size 16 in CGM-M to mitigate the issue.

4.3 Analysis across Languages Groups

Next, we discuss the performance breakdown of models across individual languages. Figure 4 expands the Relevance metrics from Figure 2 for all languages. As before, we use the the Matching-Mono as the baseline, and list the % changes over this baseline for each model and language.

We see that, all jointly trained variants (Uni and Mono*) have severe under fitting for EN. In fact if we simply remove EN from the metrics the CGM variants vastly improve upon the monolingual versions. With almost two orders of magnitude more data in EN (49M), it remains challenging to have good performance simultaneously for EN and other languages without additional tricks. In general the improvements are less for the top 5 high-resource languages which can be attributed to lesser impact from information sharing and lower exposure of these languages due to uniform sampling. Such issues have been reported in prior literature as capacity dilution  (Johnson et al., 2017; Conneau et al., 2020; Wang et al., 2020a) where there is always a trade-off between low and high resource languages. CGM while not completely eliminating it, largely mitigates the issue.

The impact of CGM with joint training is more pronounced for the bottom 10 language group. For example we see 15.49% improvement for CGM-M compared to only 3.67% for MCVAE-Uni. Finally, we see improvements of 15.76% for CGM-Mono* and 18.86% for CGM-M Mono* models, illustrating that even if we target mono-lingual models, CGM can take advantage of shared learning through joint training while saving compute.

The improvements for low resource languages, show that CGM is more data efficient due to model enhancements, while the prevention of regressions for high resource languages show a more balanced learning through training optimizations. The fact that these relevance improvements come in addition to 80% improvements in diversity, shows the remarkable effectiveness of CGM to represent the multi-modal landscape of multi-lingual RS.

5 Related Work

VAEs have been used in retrieval based Q&A Yu et al. (2020), document matching Chaidaroon and Fang (2017), and recommendations Chen and de Rijke (2018). CGM for RS is most closely related to MCVAE Deb et al. (2019) but differs in the expressive conditional priors, multi-component mixture density priors, language alignment, and training optimizations which makes it effective in a multi-lingual setting.

For multi-task scenarios, VAEs can offer significant modeling efficiencies Cao and Yogatama (2020); Rao et al. (2019) with additional improvements through mixture model priors, e.g. in Dilokthanakul et al. (2017); Yang et al. (2019) for unsupervised clustering, in Lee et al. (2021) for unsupervised meta-learning, and in Shi et al. (2019) as a multi-modal variational mixture-of-experts.

VAEs can also improve multilingual representation for low resource languages, e.g. in models like BERT Li et al. (2020), in Wei and Deng (2017) for document classification, in Chorowski et al. (2019) for disentangling phonemes for speech synthesis, and in Zhang et al. (2016); Eikema and Aziz (2019)

for neural machine translation. VAEs can improve diversity in language generation and retrieval tasks

Zhao et al. (2017); Tran et al. (2017); Shen et al. (2017); Deb et al. (2019) through better modeling efficiencies. Such results motivated us to apply VAEs for multilingual RS.

We may also consider alternative to VAEs such as training auxiliary tasks with adapters Houlsby et al. (2019), adversarial learning (Chen et al., 2018, 2019; Huang et al., 2019), and mixing pre-training and fine-tuning (Phang et al., 2020) to improve modeling in multilingual setting. This is subject of future work. We also plan to experiment with higher capacity multilingual encoders such XLM-R Lample and Conneau (2019) and InfoXLM Chi et al. (2021) to further improve the performance. However, the choice of the base encoder is orthogonal to the improvements (especially on diversification) shown in this paper.

As noted in prior work, multilingual training can have capacity dilution issues  (Johnson et al., 2017; Conneau et al., 2020; Wang et al., 2020a). Overall, multilingual models are closing the gap with monolingual counterparts for wide range of tasks Ying et al. (2021); Ranasinghe and Zampieri (2020); Yang et al. (2020), and as shown in this paper, even surpass them. Careful sampling strategies, and techniques such as Translation Language Model (TLM) can alleviate the "curse of multilinguality"  (Lample and Conneau, 2019) but we show improvements without additional data augmentation (translation pairs), and with simple uniform sampling.

6 Conclusions

In this paper we present a conditional generative Matching model (CGM) for retrieval based suggested replies. CGM not only provides relevance gains (15%), but also substantial improvements in diversity (80%). While CGM clearly advances the state of art for modeling multi-lingual RS systems, it also illustrates that through proper model choices and training optimizations, we can surpass and replace monolingual models. This is important for both industry and academia and suggests similar strategies to be applied across diverse tasks. This is subject of future work.


  • J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020) The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, Vol. 14, pp. 830–839. Cited by: §4.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In CoNLL, Cited by: §3.5.
  • Y. Burda, R. Grosse, and R. Salakhutdinov (2016) Importance weighted autoencoders.. In ICLR, Cited by: §3.5.
  • K. Cao and D. Yogatama (2020) Modelling latent skills for multitask language generation. arXiv preprint arXiv:2002.09543. Cited by: §5.
  • A. Celikyilmaz, E. Clark, and J. Gao (2020)

    Evaluation of text generation: a survey

    arXiv preprint arXiv:2006.14799. Cited by: §4.
  • S. Chaidaroon and Y. Fang (2017) Variational deep semantic hashing for text documents. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 75–84. Cited by: §5.
  • X. Chen, A. Hassan, H. Hassan, W. Wang, and C. Cardie (2019)

    Multi-source cross-lingual model transfer: learning what to share

    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3098–3112. Cited by: §5.
  • X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger (2018) Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, pp. 557–570. Cited by: §5.
  • Y. Chen and M. de Rijke (2018) A collective variational autoencoder for top-n recommendation with side information. In

    Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems

    pp. 3–9. Cited by: §5.
  • Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X. Mao, H. Huang, and M. Zhou (2021) InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3576–3588. Cited by: §5.
  • J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord (2019) Unsupervised speech representation learning using wavenet autoencoders. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, External Links: Link Cited by: §5.
  • R. Cipolla, Y. Gal, and A. Kendall (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 7482–7491. External Links: Document Cited by: §3.5.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In acl, External Links: Document Cited by: §4.3, §5.
  • B. Deb, P. Bailey, and M. Shokouhi (2019) Diversifying reply suggestions using a matching-conditional variational autoencoder. In NAACL-HLT, Cited by: §1, §2, §3.1, §5, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2, §4.
  • N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2017) Deep unsupervised clustering with gaussian mixture variational autoencoders. External Links: 1611.02648 Cited by: §5.
  • B. Eikema and W. Aziz (2019) Auto-encoding variational neural machine translation. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, pp. 124–141. External Links: Link, Document Cited by: §5.
  • H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin (2019) Cyclical annealing schedule: a simple approach to mitigating KL vanishing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 240–250. External Links: Link, Document Cited by: §3.5.
  • M. Henderson, R. Al-Rfou, B. Strope, Y. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos, and R. Kurzweil (2017) Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652. Cited by: §1, §2, §2.
  • M. Henderson, I. Vulić, D. Gerz, I. Casanueva, P. Budzianowski, S. Coope, G. Spithourakis, T. Wen, N. Mrkšić, and P. Su (2019) Training neural response selection for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5392–5404. Cited by: §2.
  • J. R. Hershey and P. A. Olsen (2007)

    Approximating the kullback leibler divergence between gaussian mixture models

    In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, Vol. 4, pp. IV–317–IV–320. External Links: Document Cited by: §3.4.
  • I. Higgins, L. M. adnd Arka Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In ICLR, External Links: Link Cited by: §3.5.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for NLP. In

    Proceedings of the 36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, pp. 2790–2799. Cited by: §5.
  • L. Huang, H. Ji, and J. May (2019) Cross-lingual multi-level adversarial transfer to enhance low-resource name tagging. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3823–3833. Cited by: §5.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, External Links: Link Cited by: §3.4.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §4.3, §5.
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016) FastText.zip: compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §4.
  • A. Kannan, K. Kurach, S. Ravi, T. Kaufmann, A. Tomkins, B. Miklos, G. S. Corrado, L. Lukács, M. Ganea, P. Young, and V. Ramavajjala (2016) Smart Reply: Automated Response Suggestion for Email. In KDD, Cited by: §1, §2.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. ICLR. Cited by: §1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1, §5, §5.
  • D. B. Lee, D. Min, S. Lee, and S. J. Hwang (2021) Meta-gmvae: mixture of gaussian vae for unsupervised meta-learning. In ICLR, External Links: Link Cited by: §5.
  • J. Lee, R. Tang, and J. Lin (2019) What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090. Cited by: §4.
  • C. Li, X. Gao, Y. Li, X. Li, B. Peng, Y. Zhang, and J. Gao (2020) Cited by: §5.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 318–327. External Links: Document Cited by: §3.5.
  • M. E. Peters, S. Ruder, and N. A. Smith (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 7–14. Cited by: §4.
  • J. Phang, I. Calixto, P. M. Htut, Y. Pruksachatkun, H. Liu, C. Vania, K. Kann, and S. Bowman (2020) English intermediate-task training improves zero-shot cross-lingual transfer too. In

    Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

    pp. 557–575. Cited by: §5.
  • T. Ranasinghe and M. Zampieri (2020) Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5838–5844. Cited by: §5.
  • D. Rao, F. Visin, A. A. Rush, Y. W. Teh, R. Pascanu, and R. Hadsell (2019) Continual unsupervised representation learning. In NeurIPS, External Links: Link Cited by: §5.
  • L. Shang, Z. Lu, and H. Li (2015) Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1577–1586. Cited by: §1.
  • X. Shen, H. Su, Y. Li, W. Li, S. Niu, Y. Zhao, A. Aizawa, and G. Long (2017) A Conditional Variational Framework for Dialog Generation. In ACL, Cited by: §5.
  • Y. Shi, S. N. ad Brooks Paige, and P. Torr (2019) Variational mixture-of-experts autoencoders for multi-modal deep generative models. In NeurIPS, Cited by: §5.
  • K. Swanson, L. Yu, C. Fox, J. Wohlwend, and T. Lei (2019) Building a production model for retrieval-based chatbots. In Proceedings of the First Workshop on NLP for Conversational AI, Florence, Italy, pp. 32–41. External Links: Link, Document Cited by: §2.
  • Q. H. Tran, G. Haffari, and I. Zukerman (2017)

    A Generative Attentional Neural Network Model for Dialogue Act Classification

    In ACL, Cited by: §5.
  • Z. Wang, Z. C. Lipton, and Y. Tsvetkov (2020a) On negative interference in multilingual language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4438–4450. Cited by: §4.3, §5.
  • Z. Wang, Y. Tsvetkov, O. Firat, and Y. Cao (2020b) Gradient vaccine: investigating and improving multi-task optimization in massively multilingual models. In ICLR, Cited by: §3.5.
  • L. Wei and Z. Deng (2017) A variational autoencoding approach for inducing cross-lingual word embeddings. In IJCAI, External Links: Link Cited by: §5.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §4.
  • L. Yang, N. Cheung, J. Li, and J. Fang (2019) Deep clustering by gaussian mixture variational autoencoders with graph embedding. In ICCV, External Links: Link Cited by: §5.
  • Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil (2020) Multilingual universal sentence encoder for semantic retrieval. In ACL, External Links: Link Cited by: §5.
  • Q. Ying, P. Bajaj, B. Deb, Y. Yang, W. Wang, B. Lin, M. Shokouhi, X. Song, Y. Yang, and D. Jiang (2021) Language scaling for universal suggested replies model. In NAACL-HLT, Industrial Track, Cited by: §2, §2, §2, §5.
  • W. Yu, L. Wu, Q. Zeng, S. Tao, Y. Deng, and M. Jiang (2020) Crossing variational autoencoders for answer retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5635–5641. External Links: Link, Document Cited by: §5.
  • B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang (2016) Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 521–530. External Links: Link, Document Cited by: §5.
  • M. Zhang, W. Wang, B. Deb, G. Zheng, M. Shokouhi, and A. H. Awadallah (2021) A dataset and baselines for multilingual reply suggestion. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 1207–1220. External Links: Link, Document Cited by: §4.
  • T. Zhao, R. Zhao, and M. Eskénazi (2017) Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In ACL, Cited by: §5.
  • X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan (2016) Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 372–381. External Links: Link, Document Cited by: §2.
  • X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu (2018) Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1118–1127. External Links: Link, Document Cited by: §2.

Appendix A Text Samples from Model Predictions

a.1 Relevance and Diversity

We created sample messages in EN manually, and predict the responses from different models: Matching in Figure 5, CGM in Figure 6 and CGM-M in Figure 7.

We see that in terms of relevance while it is hard to notice the differences on such a small sample, overall the predictions from the Matching model are less relevant than CGM. However, we can clearly distinguish the diversity of responses: predictions from Matching have a high level of duplicates where some of the responses differ by just a punctuation. While this can be easily de-duplicated using simple heuristics, the results show that inherently the Matching model ranks very similar responses at the top. The CGM models in contrast, show a lot of diversity in responses without reducing the relevance of the responses.

We also see that some of the responses are quite specific and not relevant, with some responses being rude or mildly inappropriate. It shows the issues with using responses from the Reddit dataset without careful curation (the MRS dataset does clean up for inappropriate responses but cannot completely eliminate them without human curation). Production systems usually have human curated response sets which can tackle these issues.

Figure 8: Some samples of Spanish messages and predicted with Spanish replies using the CGM-M Model.
Figure 9: Some samples of Japanese messages and predicted with Japanese replies using the CGM-M Model.
Figure 10: Some samples of English messages and predicted with Spanish replies using the CGM-M Model. While the quality is not as good as when the input message is in Spanish, the general close match of intents of the message and responses illustrates the cross lingual ability of of the model.
Figure 11: Some samples of German messages and predicted with English replies using the CGM-M Model. While the quality is not as good as when the input message is in German, the general close match of intents of the message and responses illustrates the cross lingual ability of of the model.
Figure 12: Some samples of Japanese messages and predicted with English replies using the CGM-M Model. The quality here is definitely poorer that German to English, perhaps since EN and JA are not as closely related. However we still get the general close match of intents of the message and responses.

a.2 Multi-lingual Behavior

Next we look at the multilingual ability of CGM. We translate the same set of messages used for EN for predicting responses, so as to have better comparative understanding of the quality different languages.

We present the predictions for ES in Fig. 8 and JA in Fig. 9. We see that the responses are relevant and diverse in these languages and thus CGM performs adequately in languages other than EN.

a.3 Cross-lingual Behavior

Finally we investigate the cross lingual nature of the CGM model, in order to understand if the multi-lingual models share representations and learnings across languages.

In Figure 10 we use EN messages and force the model to predict responses from the ES set. Surprisingly such a system is able to select relevant results in the target language. While the quality here is not as good, but it is interesting to see that such cross lingual prediction works quite well.

In Figure 11 we use messages in German and predict with English responses. Again the results are quite acceptable. This may be expected as English and German are closely related languages. To see slightly different pairs of languages, we look at Japanese messages with predictions in English in Figure 12. Here the quality is actually much worse, but we still see some match with the overall intent of the messages showing good cross lingual representation even for weakly related languages.