Text generation has been significantly improved with deep learning approaches in tasks such as language modeling(bengio2003neural; mikolov2010recurrent), machine translation (sutskever2014sequence; Bahdanau:nmt:iclr; vaswani2017attention), and dialog generation (sordoni2015neural). All these models include a softmax final layer to yield words. The softmax layer takes a context state () from an upstream network such as RNN cells as the input, and transforms
into the word probability with a linear projection () and an exponential activation. Each row of
can be viewed as the embedding of a word. Essentially, softmax conducts embedding matching with inner-product scoring between a calculated context vectorand word embeddings in the vocabulary.
The above commonly adopted setting for softmax imposes a strong hypothesis on the embedding space — it assumes that each word corresponds to a single vector and the context vector from the decoding network must be indiscriminately close to the desired word embedding vector in certain distance metric. We discover that such an assumption does not coincide with practical cases. Fig. 4 visualizes examples of the context vectors for utterances containing the examined words, calculated from the BERT model devlin2018bert. We make three interesting observations. a) Multi-sense: Not every word’s context vectors form a single cluster. There are words with multiple clusters (Fig. (b)b). b) Varying-variance: The variances of context vectors vary significantly across clusters. Some words correspond to smaller variances while others to larger variances (Fig. (c)c). c) Robustness: There are outliers in the context space (Fig. (b)b). These observations explain the ineffectiveness during training with the traditional softmax. The traditional way brings word embedding
ill-centered with all context vectors of the same word – even though they might belong to multiple clusters. At the same time, the variances of different words are completely ignored in the plain softmax with inner-product as the similarity score. It is also vulnerable to outliers since a single anomally would lead the word embedding to be far from the main cluster. In short, the softmax layer doesn’t have sufficient expressiveness capacity.
yang2018breaking propose Mixture-of-Softmax (MoS) to enhance the expressiveness of softmax. It replaces a single softmax layer with a weighted average of softmax layers. However, all words share the same fixed number of components and averaging weights, which heavily restrict MoS’s capacity. Furthermore, the variances of context vectors are not taken into the consideration.
In this paper, we propose KerBS, a novel approach to learn text embedding for generation. KerBS avoids the above softmax issues by introducing a Bayesian composition of multiple embeddings and a learnable kernel to measure the similarities among embeddings. Instead of a single embedding, KerBS explicitly represents a word with a weighted combination of multiple embeddings – each is regarded as a “sense”222Since there is no direct supervision, an embedding vector does not necessarily correspond to a semantic sense.. The number of embeddings is automatically learned from the corpus as well. We design a family of kernel functions to replace the embedding matching (i.e. the matrix-vector dot-product) in softmax layer. With parameters learned from text, each word (or “sense”) can enjoy individual variance in its embedding space. In addition, the kernel family is more robust to outliers than Gaussian kernels.
We conduct experiments on a variety of text generation tasks including machine translation, language modeling, and dialog generation. The empirical results verify the effectiveness of KerBS. Ablation study indicates that each part of KerBS, including the Bayesian composition and the kernel function, is necessary for the performance improvement. We also find that words with more semantic meanings are allocated with more sense embeddings, which adheres to our intuition.
2 Related work
Word Embeddings. Word2Vec (mikolov2013distributed) and GloVe (pennington2014glove) learn distributed word representations from corpus in an unsupervised way. However, only one embedding is assigned to each word, which not only suffers from ignoring polysemy but also could not provide context related word embeddings. Recent works (AlecOpenAI; peters2018deep; devlin2018bert)
indicates that pre-trained contextualized word representations are beneficial for downstream natural language processing tasks. BERT(devlin2018bert) pre-train a masked language model with a deep bidirectional Transformer and it achieves state-of-the-art performance in various NLP tasks.
Multi-Sense Word Embeddings. Early works obtain multi-sense word embeddings by first training single point word embeddings and then clustering the context embeddings (for example, the average embedding of neighbor words). But these methods are not scalable and take lots of efforts in parameter tuning (reisinger2010multi; huang2012improving). tian2014probabilistic introduce a probabilistic model, which uses a variable to control sense selection of each word. liu2015topical add a topic variable for each word, and condition word embeddings on the topic variable. Both of tian2014probabilistic and liu2015topical can be easily integrated into Skip-Gram model (mikolov2013distributed), which is highly efficient. Other works (chen2014unified; jauhar2015ontologically; chen2015improving; wu2015sense) further improve the performance of multi-sense embeddings by making use of huge corpora such as WordNet miller1995wordnet and Wikipedia. However, these works are mainly focused on text understanding rather than text generation.
Word Embedding as a Distribution. In order to represent the semantic breadth of each word, vilnis2015mikolov2013distributed, vilnis2015 use KL-divergence of the embedding distributions to measure the similarities between words. To improve the numerical stability of Gaussian word embeddings, especially when comparing very close or very distant distributions, sunEtAl2018 propose to replace KL-divergence with Wasserstein distance. Though Gaussian word embeddings perform well in word-level tasks such as similarity and entailment detection, they cannot be directly applied to the scenario of text generation, because it is difficult to perform embedding matching between Gaussian word embeddings and output embeddings, which are usually single points in the embedding space.
Most text generation models generate words through an embedding matching procedure. Intuitively, at each step, upstream networks such as RNN decoders compute a context vector according to the encoded information from input and previously generated words. The context vector serves as a query to search for the most similar match from a pre-calculated vocabulary embeddings . In practice, this is implemented with an inner-product between and . Normalized probabilities over all words are computed with the softmax function. Words with the highest probabilities will be chosen during the inference process.
Specifically, given an utterance , a GRU decoder calculates as follows:
At time step , its word embedding is obtained by looking up the previous output word in the word embedding matrix (Eq. (1)). Here is the embedding of the -th word in the vocabulary. is the vocabulary size. The context embedding of the -th step will be obtained from GRU by combining information of and ( Eq. (2)). Other decoders such as Transformer vaswani2017attention work similary.
Eq. (3) performs embedding matching between and , and probabilities of words will be obtained by a softmax activation. Intuitively, to generate the correct word , the context embedding should lie in a small neighborhood around ’s word embedding .
4 Proposed KerBS
In this section, we first introduce KerBS for text generation. It is designed according to the three observations mentioned in the introduction: multi-sense, varying-variance, and robustness. Then we provide a training scheme to dynamically allocate senses since it is difficult to directly learn the number of senses of each word.
4.1 Model Structure
KerBS assumes that the space of context vectors for the same word consists of several geometrically separate components. Each component represents a “sense”, with its own variance. To better model their distribution, we replace Eq. (3) with the following equations:
Here, is the sense index of the step . Its value takes corresponding to the -th sense of the -th word in vocabulary. is the number of senses for word , which may be different for different words. Instead of directly calculating the probabilities of words, KerBS first calculates the probabilities of all senses belonging to a word and then sums them up to get the word probability.
The probability of output sense in Eq. (5) is not a strict Gaussian posterior, as the training of Gaussian models in high dimensional space is numerical instable. Instead, we propose to use a carefully designed kernel function, to model the distribution variance of each sense. Concretely, we replace the inner product in Eq. (3) with kernel function , which depends on a variance-related parameter . is a simplified notation containing all pairs of kernal values . With different for each sense, we can model the variances of their distributions separately.
4.1.1 Bayesian Composition of Embeddings
In this part, we introduce in detail how KerBS models the multi-sense property of words. Intuitively, we use Bayesian composition of embeddings in text generation, because the same word can have totally different meanings. For words with more than one sense, their corresponding context vectors can be usually divided into separate clusters (see Figure 4). If we use single-embedding models such as traditional softmax to fit these clusters, the word embedding will converge to the mean of these clusters and could be distant from all of them. This may lead to poor performance in text generation.
As shown in Eq. (4), we can allocate different embeddings for each sense. We first obtain the sense probabilities by performing weight matching between context vector and sense embedding matrix . Then we add up the sense probabilities belonging to each word to get word probabilities.
We adopt weight tying scheme (inan2016tying), where the decoding embedding and the input embedding are shared. Since is a matrix of sense embeddings, it cannot be directly used in the decoding network for next step as in Eq. (1). Instead, we obtain embedding by calculating the weighted sum of sense embeddings according to their conditional probabilities. Assume that is the input word at step ,
4.1.2 Embedding Matching with Kernels
To calculate the probability of each sense, it is very straightforward to introduce Gaussian distributions in the embedding space. However, it is difficult to learn a Gaussian distribution for embeddings in high dimensional space for the following reasons. Context vectors are usually distributed in low dimensional manifolds embedded in a high dimensional space. Using an iostropic Gaussian distribution to model embedding vectors in low dimensional manifolds may lead to serious instability. Assume in a -dimensional space, the distribution of follows in a -dimensional subspace. We build a model to fit the embedding points. But there are often some noisy outliers, which are assumed to distribute uniformly in a cube with edge length 1 and centered at the origin. Then the average square distance between an outlier and the origin is , which increases linearly with . The -likelihood to maximize can be written as:
where is the set of data points including outliers. Denote the proportion of outliers in as . Since equals for points generated by the oracle and for outliers, is dominated by outliers when is large. The optimal approximately equals to . With large , optimal , which is independent of real variance . As expected, we find that directly modeling the Gaussian distributions does not work well in our preliminary experiments.
Therefore we design a kernel function to model embedding variances, which can be more easily learned compared with Gaussian mixture model. Specifically, we replace the inner product, which can be regarded as a fixed kernel around whole space, with a kernel function
Here is a parameter controlling the embedding variances of each sense and is a normalization factor. When , , which degenerates to common inner product. As shown in Figure 7, with a small , embeddings are concentrated on a small region, while a large leads to a flat kernel. Finally, parameters for the -th word could be:, where and are the embedding and kernel parameter of sense
. Intuitively, in the original space with inner product similarity, the density of probability mass is uniformly distributed. Butdistorts the probabilistic space, making the variances of context vectors differ over different senses.
the gradient of each is bounded for fixed . It results from the continuity of when and the fact that , when . As a result, a small proportion of outliers or noise points will not have a major impact on training stability.
4.2 Training Scheme
It is difficult to empirically determine the sense numbers of each word, which is a very large set of hyper-parameters. Also, properties of the same word may vary among different corpora and tasks. So we design a training scheme for KerBS, which includes dynamic sense allocation. Instead of providing the sense number for each word, we only need to input the total sense number. The algorithm will automatically allocate senses.
Details of the training scheme are shown in Algorithm 1. Specifically, to obtain parameters for both KerBS and upstream network , which outputs the context vectors, the whole process consists of allocation and adaptation phases. Before training, and
are initialized by a random matrix and a random vector respectively. We randomly allocatesenses to words. After initialization, we first turn to the adaptation phase. Given a sequence in training set, at step , we get the context vector from . Then sense and word probabilities are calculated by Eq. (4) and Eq. (5), respectively. Afterwards, we calculate the log-probability of generating . And we maximize by tuning , and :
During the adaption phase, KerBS learns , the sense embedding vector, and , the indicator of distribution variance.
During the allocation phase, we remove redundant senses and reallocate them to poorly predicted words. To determine senses to remove and words which need more senses, we record the moving average of each word’s log prediction accuracy and sense usage :
where is the updating rate. For a word
, if after several epochsis consistently lower than a threshold , we think that the senses currently allocated to is not enough. Then we delete the least used sense and reallocate it to . We alternatively perform adaption and reallocation until convergence.
4.3 Theoretical Analysis
In this part, we explain why KerBS has the ability to learn the complex distributions of context vectors. We only give a brief introduction to the following lemmas and leave more detailed proofs in the appendix.
KerBS has the ability to learn the multi-sense property. If the real distribution of context vectors consists of several disconnected clusters, KerBS will learn to represent as many clusters as possible.
Each cluster of word ’s context vectors attracts ’s KerBS sense embeddings, in order to draw these embeddings nearer to increase . However, if a cluster has already been represented by a KerBS sense, its attractions to embeddings of other senses get weaker. So they will converge to other clusters. Instead of gathering together in a few clusters, senses will try to represent as many clusters of context vectors’ distribution as possible. ∎
KerBS has the ability to learn variances of embedding distribution. For distributions with larger variances, KerBS learns larger .
The optimized is a solution of equation . We only need to explain that, when the variance of grows, the solution of the equation gets larger. ∎
In this section, we empirically validate the effectiveness of KerBS. We will first set up the experiments, and then give the experimental results in Section 5.2.
We test KerBS on several text generation tasks, including:
Machine Translation (MT) is conducted on IWSLT’16 DeEn, which contains 196k pairs of sentences for training.
Language modeling (LM) is included to test the unconditional text generation performance. Following previous work, we use a 300k, 10k and 30k subset of One-Billion-Word Corpus for training, validating and testing, respectively.
Dialog generation (Dialog) is also included. We employ the DailyDialog dataset from li2017dailydialog for experiment, by deleting the overlapping of train and test sets in advance.
Note that these text generation tasks emphasize on different sides. MT is employed to test the ability of semantic transforming across bilingual corpus. LM is included to test whether KerBS can generally help generate more fluent sentences. Dialog generation even needs some prior knowledge to generate good responses, which is the most challenging task.
For LM, we use Perplexity (PPL) to test the performance. For MT and Dialog, we measure the generation quality with BLEU-4 and BLEU-1 scores (papineni2002bleu)
. Human evaluation is also included for Dialog. During human evaluation, 3 volunteers are requested to label Dialog data containing 50 sets of sentences. Each set contains the input sentences as well as output responses generated by KerBS and baseline models. Volunteers are asked to score the responses according to their fluency and relevance to the corresponding questions. (See detailed scoring in the appendix.) After responses are labeled, we calculate the average score of each method. Then a t-test is performed to reject the hypothesis that KerBS is not better than the baseline methods.
5.1 Implementation Details
For LM, we use GRU language model (chung2014empirical) as our testbed. We try different sets of parameters, including RNN layers, hidden sizes and embedding dimensions. The model that performs best with traditional softmax is chosen as the baseline.
For MT and Dialog, we implement the attention-based sequence to sequence model (Seq2Seq, (Bahdanau:nmt:iclr)) as well as Transformer (vaswani2017attention) as our baselines. For Seq2Seq, (hidden size, embedding dimension) are set to (512, 256) and (1024, 512), respectively. And For Transformer, (hidden size, embedding dim, dropout, layer num, head num) is set to (288, 507, 0.1, 5, 2) for both MT and Dialog, following lee2018deterministic. All models are trained on sentences with up to 80 words. We set the batch size to 128 and the beam size to 5 for decoding. For both German and English, we first tokenize sentences into tokens by Moses tokenizer (Koehn2007MosesOS). Then BPE (BPE) is applied to segment each word into subwords.
Adam (KingmaB14:adam:iclr) is adopted as our optimization algorithm. We start to decay the learning rate when the loss on validation set stops to decrease. For LM, we set the initial learning rate to 1.0, and the decay rate to 0.8. For MT and Dialog, the initial learning rate is 5e-4 and the decay rate is 0.5.
5.2 Results of Text Generation
|Tasks||Metrics||Seq2Seq||Seq2Seq+ MoS (yang2018breaking)||SeqSeq + KerBS|
|Tasks||Metrics||Transformer||Transformer + MoS (yang2018breaking)||Transformer + KerBS|
Machine Translation For machine translation, KerBS achieves higher BLEU-4 scores on Seq2Seq(+1.37) and Transformer(+1.29). However, the performance gain of MoS is not significant, and it is not even as good as vanilla Transformer model. Cases of MT on Transformer are shown in Table 3.
|Source||meine gebildete Mutter aber wurde Lehrerin.|
|Transformer||my foster mother was a teacher.|
|+ MoS||and my educated mother was a teacher.|
|+ KerBS||but my educated mother became a teacher.|
man erreicht niemals eine Gemeinde mit Ideen, man setzt sich mit den Einheimischen zusammen.
|Transformer||you never achieve a community with ideas; you put together with local people.|
|+ MoS||you never get a community with ideas, you’re putting together with indigenous people.|
|+ KerBS||you never get to a community with ideas, and you sit with the local people.|
Language Model As expected, KerBS achieves lower PPL (102.17) on LM compared with both MoS (102.72) and traditional softmax (103.12). Although it introduces more parameters, KerBS does not lead to overfitting. On the contrary, the increased complexity in KerBS helps the model to better capture the information of the embedding space.
Dialogue Generation We also include results of dialog generation. Unlike tasks where source and target sentences are highly aligned, dialog generation may need some prior knowledge for obtaining good responses. Moreover, the multi-modality of the generated sentences is a serious problem in Dialog. We expect that much expressive structure of KerBS could help. Since the performance of Transformer is not comparable to Seq2Seq on Dialog generation, we will focus on Seq2Seq in this part. KerBS achieves a BLEU-1 score of 17.85 on test set, which is remarkable compared with the baselines. Human evaluations also confirm the effectiveness of using KerBS in dialog generation. After performing a one-tailed hypothesis test, we find that the p-value is lower than 0.05, which means that the obtained improvements on Dialog systems are nontrivial. We list some of the generated responses of different models in Table 4.
|Source||what do you mean ?|
|Seq2Seq||i mean, what s up with the and iron bars on your windows.|
|+ MoS||well, how can i put this? france is a terrible team.|
|+ KerBS||well, i mean. we always do the same thing. there s no variety in our lives.|
|source||now , what seems to be the trouble ?|
|Seq2Seq||trouble is trouble.|
|+ MoS||yeah. and he was.|
|+ KerBS||not bad. but i have a bad cold today.|
|source||it can t be more than fourteen days late for us to accept it .|
|Seq2Seq||it will just fine.|
|+ MoS||well, i see. have you been back to work then?|
|+ KerBS||maybe you re right. i think we should take it.|
5.3 Ablation Study
|Seq2Seq + KerBS||27.28|
|w/ only single sense||26.80|
|w/o dynamic allocation||27.00|
We perform ablation study of three variants of KerBS on the MT task. KerBS w/o kernel removes the kernel function from KerBS, so that distribution variances are no longer explicitly controlled. We find that it loses 0.49 BLEU scores compared with original KerBS, which indicates that to explicitly express distribution variances of hidden states is important and KerBS works well in doing so (Table 5). KerBS with single sense replaces the multi-sense model with single-sense one, which also leads to performance decline. This further confirms our assumption that the distribution of context vectors is multi-modal. In such cases, the output layer should also be multi-modal. In KerBS w/o dynamic allocation, each word is allocated with a fixed number of senses. Though it still performs better than single sense models, it is slightly worse than full KerBS model, which shows the necessity of dynamic allocation.
5.4 Detailed Analysis
In this part, we verify that KerBS learns reasonable sense number and variance parameter by examples. And we have the following conclusions.
Firstly, KerBS can learn the multisense property. From Table 6, we find that words with a single meaning, including some proper nouns, are allocated with only one sense. But for words with more complex meanings, such as pronouns, more senses are necessary to represent them. (In our experiment, we restrict each word’s sense number between 1 and 4, in order to keep the training stable.) In addition, we find that words with 4 senses have several distinct meanings. For instance, ”change” means transformation as well as small currency.
Secondly, in KerBS is an indicator for words’ semantic scopes. In figure 8 we compare the of 3 sets of nouns. For each set of them, we find words denoting bigger concepts (such as car, animal and earth) have larger .
5.5 Time Complexity
Compared with baselines, the computation cost of incorporating KerBS into text generation mainly lies with the larger vocabulary for embedding matching, which is only a portion of the whole computation of text generation. Empirically, when we set the total sense number to about times the vocabulary size, KerBS takes twice as long as vanilla softmax for one epoch.
Text generation requires a proper embedding space for words. In this paper, we proposed KerBS to learn better embeddings for text generation. Unlike traditional Softmax, KerBS includes a Bayesian composition of multi-sense embedding for words and a learnable kernel to capture the similarities between words. Incorporating KerBS into text generation could boost the performance of several text generation tasks, especially the dialog generation task. Future work includes proposing better kernels for generation and designing a meta learner to dynamically reallocate senses.
We would like to thank Xunpeng Huang and Yitan Li for helpful discussion and review of the first version. We also wish to thank the anonymous reviewers for their insightful comments.
Appendix A Proofs
KerBS has the ability to learn the multi-sense property. If the real distribution of context vectors is composed of several disconnected parts, KerBS components will learn to represent as many as these parts.
We only prove the simplest situation under traditional inner product kernel. We assume that the real context vectors of the -th word are composed of two disconnected parts and it is also allocated with two KerBS senses. We also assume that part 1 has already been represented by sense , i.e., for in part 1. Then for the second newly allocated sense , we find
where and are context vectors in part 1 and 2, respectively. for all senses except and . As part 1 has already be well represented by sense , should be much larger than .
KerBS has the ability to learn model variances. For distributions with larger variances, KerBS learns larger .
We will only give a heuristic proof for the situation whereis a small positive number. The proof is also done under single-sense condition. If is in other intervals, the proof will be more complex, but the ideas are the same.
From the definition of ,
where is the the expected output for , and we temporarily hide other parameters.
We can derive the partial derivative of with respect to :
When is small, we can approximate by the following equation:
where is abbreviated as .
Hence, when gets smaller, tends to increase, since when and is usually positive when . So when distribution variance increases, tends to decrease, because context vectors are farther from the mean vector. As a result, will increase. ∎
Appendix B Experiment Details
Scoring Standard for Human Evaluation
The volunteers are asked to score responses generated by all models according to the following standard:
Score 0 : response which is neither fluent nor relative to the input question.
Score 1 : response which is either fluent or relative to the input question, but not both.
Score 2 : response which is both fluent and relative to the input question.