Fast and Simple Mixture of Softmaxes with BPE and Hybrid-LightRNN for Language Generation

by   Xiang Kong, et al.

Mixture of Softmaxes (MoS) has been shown to be effective at addressing the expressiveness limitation of Softmax-based models. Despite the known advantage, MoS is practically sealed by its large consumption of memory and computational time due to the need of computing multiple Softmaxes. In this work, we set out to unleash the power of MoS in practical applications by investigating improved word coding schemes, which could effectively reduce the vocabulary size and hence relieve the memory and computation burden. We show both BPE and our proposed Hybrid-LightRNN lead to improved encoding mechanisms that can halve the time and memory consumption of MoS without performance losses. With MoS, we achieve an improvement of 1.5 BLEU scores on IWSLT 2014 German-to-English corpus and an improvement of 0.76 CIDEr score on image captioning. Moreover, on the larger WMT 2014 machine translation dataset, our MoS-boosted Transformer yields 29.5 BLEU score for English-to-German and 42.1 BLEU score for English-to-French, outperforming the single-Softmax Transformer by 0.8 and 0.4 BLEU scores respectively and achieving the state-of-the-art result on WMT 2014 English-to-German task.


page 1

page 2

page 3

page 4


WeChat Neural Machine Translation Systems for WMT21

This paper introduces WeChat AI's participation in WMT 2021 shared news ...

Korean-English Machine Translation with Multiple Tokenization Strategy

This work was conducted to find out how tokenization methods affect the ...

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent...

Improving Neural Language Modeling via Adversarial Training

Recently, substantial progress has been made in language modeling by usi...

Low-Memory Neural Network Training: A Technical Report

Memory is increasingly often the bottleneck when training neural network...

Training Neural Machine Translation (NMT) Models using Tensor Train Decomposition on TensorFlow (T3F)

We implement a Tensor Train layer in the TensorFlow Neural Machine Trans...

The Evolved Transformer

Recent works have highlighted the strengths of the Transformer architect...


Sequence-to-Sequence model (seq2seq) (Sutskever, Vinyals, and Le, 2014; Bahdanau, Cho, and Bengio, 2014)

has led to significant research progress on language generation over the last few years. A typical seq2seq model employs an auto-regressive factorization of the joint distribution and outputs the conditional probability of each token given the previous tokens. A standard approach to calculate the conditional probability is to apply the Softmax function over the logits.

Though seq2seq models with a standard Softmax output function are largely effective, Yang et al. (2018) show that the standard Softmax formulation limits the expressiveness of the generation model and results in the Softmax bottleneck. They propose Mixture of Softmaxes (MoS) to address this issue and demonstrate improved performances on language modeling. However, MoS poses a non-negligible burden on the computation time and the memory consumption. Specifically, MoS outputs a weighted average of Softmax components, where computing each Softmax involves a huge dot-product between the hidden state and the embedding matrix, costing a considerable amount of time and memory.

To address the aforementioned drawbacks, a natural idea is to improve the time- and memory-efficiency of computing each Softmax. On a high level, we aim at an encoding mechanism of the vocabulary so that each word can be represented as a code sequence. Then, computing a single Softmax reduces to the product of a sequence of conditional code distributions. Given a code dictionary size, the number of possible words that can be represented increases exponentially w.r.t. the code sequence length, while the computation and memory cost only increases linearly. Hence, such an encoding scheme can theoretically reduce the time and memory consumption exponentially. Clearly, some encoding schemes must have better statistical properties than others and thus lead to better empirical performances. Ideally, the encoding could be learned directly from the data.

In this work, we investigate two algorithms for these purposes: The first one is called Hybrid-LightRNN, which learns a encoding mechanism from the data based on the language modeling objective. The other one is Byte Pair Encoding (BPE) (Gage, 1994; Sennrich, Haddow, and Birch, 2016), which was originally proposed to help with translating rare words. When evaluated on machine translation (MT) and image captioning, both of these approaches can effectively reduce the time and memory consumption of MoS with no performance losses. Specifically, utilizing MoS brings a performance gain of up to BLEU scores on IWSLT 2014 German to English and CIDEr scores on image captioning. On WMT 2014 machine translation benchmarks, we achieve a BLEU score of on English-to-German and on English-to-French, leading to a state-of-the-art result on the WMT 2014 English-to-German task.

Our contribution is two-fold. Firstly, we propose to use Hybrid-LightRNN and BPE to make MoS time- and memory-efficient. Secondly, we demonstrate the empirical effectiveness of MoS on sentence generation by improved results on machine translation and image captioning.

Background: Mixture of Softmaxes

Mixture of Softmaxes (MoS) (Yang et al., 2018) is introduced to address the expressiveness limitations of Softmax-based models. In this section, we briefly review the motivation and the formulation of MoS.

With the autoregressive factorization, a generation model estimates the distribution of the next token

given the context . In language modeling, the context is composed of previous words of . In conditional generation tasks such as MT or image captioning, the context also contains the source sentence or the image. Let denote the ground-truth distribution of the next token given context

. Then the standard Softmax function computes the probability distribution



is the context vector or the RNN hidden state and

is the word embedding.

Softmax Bottleneck

Yang et al. (2018) show the expressiveness limitation of the Softmax function from a matrix factorization perspective. Specifically, suppose that the number of valid contexts is finite. We list all contexts as . Let denote the log probability of the ground-truth distribution, the word embedding matrix and the context representation matrix respectively, where is the number of contexts, is the vocabulary size and is the dimensionality of the embedding vector and the context vector. In other words, .

Let denote all matrices obtained by applying row-wise shifting to . Since all matrices in result in the same probability distribution due to the normalization term in the Softmax, the Softmax function can output the ground-truth distribution if and only if the factorization approximate any matrix in .

However, in language generation tasks, matrices in cannot be approximated by because of the differences in their matrix ranks. More specifically, the rank of is limited by the embedding vector dimensionality . In comparison, as shown in  Yang et al. (2018), and any other matrices within have similar high ranks since different contexts result in highly different probability distributions of the next token. Consequently, the ground-truth distribution cannot be approximated by the Softmax distribution , which results in the Softmax Bottleneck.


To tackle the Softmax bottleneck problem, MoS formulate the distribution as the weighted average of Softmax components:


where is the mixture weight of the -th Softmax component and is the -th context vector. On language modeling, it has been shown empirically that such a formulation leads to a high rank matrix. Note that since all Softmaxes share the same word embedding matrix, the number of parameters do not increase rapidly with more mixtures, preventing overfitting.

The mixture weight and the context vectors are computed as


where denotes a vector representation of the context . and denote the parameters of the mixture weight and the parameters of the context vector with a slight abuse of notation.

In our machine translation experiments, the attention model 

(Bahdanau, Cho, and Bengio, 2014) is employed to obtain an context vector of the source sentence. is obtained by passing the concatenation of the context vector and the RNN hidden state through an MLP. In the captioning case, the decoder is a vanilla RNN and the vector representation is the decoder’s hidden state.

Time and Memory Cost

As shown in Eqn. 1, MoS computes Softmaxes and output the weighted average of the probability distributions. Though MoS effectively increases the expressiveness of a generation model, it also incurs a large time and memory cost since it needs to perform Softmax operations on the whole vocabulary. The time and memory costs not only hinder rapid algorithm developments but also limit the mixture number when resources are limited, restricting the power of MoS.

Encoding Words for Efficient MoS

In this section, we introduce two word encoding algorithms to reduce the memory and time consumptions of MoS. We aim to obtain an encoding mechanism of each word where the number of potential codes is much smaller than the vocabulary size. In theory, given a code dictionary, the number of possible words that can be represented increases exponentially w.r.t. the code sequence length, while the computation cost only increases linearly. Then a generation model is trained to output a code sequence to generate a sentence. By decomposing words into shared codewords, the Softmax in the generation model only needs to be computed over the code dictionary. However, the encoding function need to be optimized to reflect semantic correlations between words since the semantic representations of words are shared through the embeddings of the codes.

In this section, we first introduce the optimal transport problem which provides a useful framework for optimizing encoding functions. Then we introduce two algorithms used to learn the encoding function, namely, Hybrid-LightRNN and BPE, where Hybrid-LightRNN learns the encoding function by optimizing the language modeling likelihood while BPE provides an encoding function based on subword frequency.

Background: Learning Encoding Mechanisms Using Optimal Transport

We first provide an optimal transport (OT) (Peyré and Cuturi, 2017) perspective of learning the encoding mechanism. Broadly speaking, optimal transport is the assignment problem between probability distributions. In the case of learning encoding mechanisms, the probability distributions are simply the delta distribution for each word and each code sequence. We define the following Wasserstein distance between the word space and the code sequence space.


where enumerates words in the vocabulary and is the set of all code sequences. is an indicator function of whether word is assigned to code sequence . The constrains over ensures that each word is only mapped to a code sequence and that each code sequence is only mapped to a word. Hence a valid would naturally result in a desired bijection mapping. is the cost of assigning word to code sequence and is the overall cost and the optimization objective. For simplicity, we assume that the number of possible code sequences is equal to the vocabulary size, since we can always add unused tokens to the vocabulary and assign them to redundant code sequences.


As mentioned earlier, the encoding function should be learned so that words can effectively share semantics through common codes. More importantly, since the encoding function is used in language generation tasks, it is desirable to have encoded sequences that are easy to model by current RNN-based models. As language modeling can be used to measure the difficulty of modeling the code sequences, we propose Hybrid-LightRNN that optimizes the encoding function according to the language modeling objective.

To compute the probability of a sentence under the current encoding function, we replace each sentence by its code sequence and compute the log probability of its corresponding code sequence. Formally, let be the encoding function which maps a word into a code sequence . The log probability of a sentence is as follows when an encoding function is employed:


where is the concatenation of code sequences of the context and is the parameters of a neural language model. Then the optimal encoding function is defined as:


where is the training corpus.


Ideally, for each encoding function , we would like to find the optimal language modeling cost. However, it is too computationally heavy to enumerate the combinatorial possibilities of encoding function and evaluate the language modeling performance. Instead, we would like to jointly optimize the encoding function and the language model parameters . However, the encoding function is represented by discrete parameters, hence we resort to an approximated algorithm. The high-level idea of the approximated algorithm is to iteratively optimize one of the language model parameters and the encoding function while keeping the other one fixed. Since all language model parameters in are fully differentiable, we can simply utilize SGD to optimize them. Then, the core difficulty lies in the step of optimizing the discrete parameters of , during which, ideally, we want the following two properties to hold

  • The encoding function remains valid. In other words, the mapping between words and code sequences remains bijections.

  • The language modeling objective function is decreased.

At first glance, this optimization problem seems intractable since there are combinatorially many possible

. However, since finding the optimal mapping is naturally an assignment problem, we can rely on existing algorithms of optimal transport if we can approximate the language modeling loss function by the Wasserstein distance defined in Eqn.

3. The key idea here is to decompose the corpus level likelihood to the encoding decisions of each word. More specifically, in the language modeling objective, for each word, we are measuring the likelihood of its current code sequence for each occurrence in the training data. Naturally, we can define the cost of assigning the word to the corresponding code sequences by the likelihood of other code sequences.

Formally, the cost of assigning word to code sequence can be defined as where is the indicator function. Here, since the context is encoded by the original encoding function, we implicitly assume the independence between the costs of different words’ mapping. We further assume the independence between codes and approximate as to avoid evaluating the language model for times. Finally, we obtain the cost function as follows:


Note that, when we use the original encoding function, i.e., , the optimal transport objective equals to the current language modeling likelihood, i.e., . Hence the Wasserstein distance will always be lower than the current language modeling cost. However, because of the independence assumptions, the language modeling loss is not guaranteed to decrease after optimizing the encoding function.

A canonical solution to the optimal transport problem is the minimum cost maximum flow (MCMF) algorithm (Ahuja et al., 1993). However, the computation complexity of the MCMF is . Following LightRNN (Li et al., 2016), we adopt an -approximation algorithm (Preis, 1999), which has a complexity of . With the approximation algorithm, the time consumption of solving the optimal transport problem only constitutes a small proportion of the whole LightRNN algorithm when taking the time of training the neural language model into account.

Increasing the Capacity for Frequent Words

Although the algorithm does not require the maximum code sequence length to be small, in our experiments, we set to since the dictionary size can already be reduced to if the first code and the second code can take values respectively. However, if we only uses number of codes to model all words in the vocabulary, though the efficiency is improved greatly, the capacity of the model is hurt significantly, since each word is forced to share embeddings with words which share the first code or the second code with it.

As a result, the encoding function should assign exclusive codes to important words. Since frequent words have a large impact on the overall performance, we set the encoding function so that the codes of the most frequent words are not shared with other words. Specifically, for word , we manually specify their code sequences to be a length sequence . For all other words, their code sequences do not contain code and are learned using the optimal transport objective.

Since the code sequence has maximum length two, we can use the following table to represent the encoding function, where the row and column denotes the first and the second respectively:

where the matrix is a sparse diagonal matrix to which frequent words are assigned. is a dense matrix learned through optimal transport. To fit words into the table, the dimensions of and should satisfy .

LightRNN (Li et al., 2016) is a special case of Hybrid-LightRNN where they do not model frequent words separately. We will show in the experiments that it is very important to model frequent words separately. Furthermore, LightRNN defines the dimension of the first code to be equal to the dimension of the second code for best efficiency. However, when one dimension is larger, the model can have more embedding vectors and has a larger capacity, which also results in an encoding mechanism similar to the hierarchical Softmax (Morin and Bengio, 2005).

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) (Gage, 1994; Sennrich, Haddow, and Birch, 2016) was introduced to address the difficulties of translating rare words and out-of-vocabulary words in machine translation. BPE is of interest here since it can reduce the vocabulary size effectively and can speedup the computation of Softmaxes.

In the encoding learned by the BPE, each code is a subword. Formally, BPE learns the code dictionary as follows: We initialize the code dictionary as the set of all possible characters and break all words into sequences of codes. Then we iteratively run the following steps to add new codes to the dictionary:

  • Count the frequency of all code pairs within training data. Find out the most frequent pair/bigram of codes and .

  • Add the new code to the dictionary. Replace all occurrence of pair with .

  • End the iteration if the dictionary size reaches a threshold. Otherwise go to step .

BPE is an algorithm based on heuristics. However, the strong inductive bias of BPE always gives more capacity to frequent words when it comes to the tradeoff between efficiency and capacity if we vary the subword unit dictionary size, since the more frequent words will be segmented into fewer parts, which will lead to more exclusive embeddings instead of shared embeddings.

When we use a larger code dictionary, more frequent words and subwords are added to the dictionary and their semantics are modeled by separate embedding vectors, leading to a larger model capacity. On the other hand, the model efficiency is improved with a smaller subword dictionary.

Related Work

Apart from the previously mentioned related works, mixture of Softmaxes is closely related to works that mix representation vectors (Eigen, Ranzato, and Sutskever, 2013; Shazeer et al., 2017).  Yang et al. (2018) show that this approach does not solve the softmax bottleneck problem.

Hierarchical Softmax (Morin and Bengio, 2005) is an extensively studied technique to improve the efficiency of Softmaxes. Morin and Bengio (2005) uses the synsets in the WordNet to build the hierarchical tree. Mnih and Hinton (2009); Shen et al. (2017) propose to learn the hierarchical tree with a clustering algorithm. The idea of separately modeling frequent words is also explored in Adaptive Softmax (Grave et al., 2016)

. Although hierarchical Softmax can reduce the time and memory consumptions during training, it still requires computing the Softmax over the whole vocabulary during testing. Noise Contrastive Estimation 

(Gutmann and Hyvärinen, 2012; Mnih and Teh, 2012) and Negative Sampling (Mikolov et al., 2013) can also speed up Softmax during training.


Machine Translation (IWSLT) Image Captioning (MSCOCO)
Model # Softmaxes BLEU BLEU-4 METEOR CIDEr
Baseline 1 27.41 0.15 29.64 0.20 23.60 0.12 88.50 0.47
Hybrid-LightRNN-MoS 9 28.79 0.23 30.02 0.16 23.87 0.18 88.96 0.21
BPE-MoS 9 28.91 0.06 30.06 0.10 24.00 0.24 89.26 0.11
Table 1: Overall performance comparisons on IWSLT and MSCOCO
Wu et al. (2016) 26.30 41.16
Shazeer et al. (2017) 26.03 40.56
Gehring et al. (2017) 26.43 41.62
Vaswani et al. (2017) 28.4 41.8
Ahmed, Keskar, and Socher (2017) 28.9 41.4
Dehghani et al. (2018) 28.9 N/A
Shaw, Uszkoreit, and Vaswani (2018) 29.2 41.5
Ott et al. (2018) 29.3 43.2
Our Transformer (Base) 27.4 38.2
Transformer-MoS (Base) 28.0 39.1
Our Transformer (Big) 28.7 41.7
Transformer-MoS (Big) 29.5 42.1
Table 2: Experiment results on the WMT 2014 English-German (EN-DE) and English-French (EN-FR) where Transformer-MoS denotes the Transformer model with MoS.
# Softmaxes Model Machine Translation (IWSLT) Image Captioning (MSCOCO)
Memory Speed BLEU Memory Speed BLEU-4
1 Baseline 5.18 40.38 27.41 0.15 0.96 13.29 29.64 0.20
Hybrid-LightRNN-Baseline 3.61 33.57 27.43 0.21 0.73 10.35 29.69 0.15
BPE-Baseline 3.52 32.78 27.37 0.17 0.70 9.96 29.68 0.12
3 MoS 10.24 89.94 28.42 0.14 1.39 25.66 29.91 0.14
Hybrid-LightRNN-MoS 5.37 49.19 28.36 0.11 1.01 15.49 29.93 0.14
BPE-MoS 5.27 46.83 28.47 0.16 0.99 15.03 29.96 0.08
Table 3: Memory and time efficiency comparisons on MT and image captioning when using the same number of Softmaxes. Bold faces highlight the best in the corresponding category. The shown memory is in GB and the speed is in ms/batch.
Memory (GB) Speed (ms/batch) Model # Mixtures BLEU
5.3 45.5 Baseline 27.41 0.15
Hybrid-LightRNN-MoS 28.36 0.11
BPE-MoS 28.47 0.16
10.4 90.4 MoS 28.42 0.14
Hybrid-LightRNN-MoS 28.79 0.23
BPE-MoS 28.91 0.06
Table 4: Comparisons on IWSLT under the same memory and time budget. The Softmaxes size is number of Softmaxes Softmax dictionary size. Bold faces highlight the best in the corresponding category
Memory (GB) Speed (ms/batch) Model Softmaxes Size BLEU-4 METEOR CIDEr
1.0 14.6 MoS 29.64 0.20 23.60 0.12 88.50 0.47
Hybrid-LightRNN-MoS 29.93 0.14 23.74 0.29 88.52 0.18
BPE-MoS 29.96 0.08 23.67 0.34 88.61 0.27
4.3 26.4 MoS 29.91 0.14 23.69 0.20 88.84 0.66
Hybrid-LightRNN-MoS 30.02 0.16 23.87 0.18 88.96 0.21
BPE-MoS 30.06 0.10 24.00 0.24 89.26 0.11
Table 5: Comparisons on image captioning under the same memory and time budget. The Softmaxes size is number of Softmaxes Softmax dictionary size. Bold faces highlight the best ones in each category.
row words
45 700 3.3 28 19 7 86 35
48 around between by into down for off
54 mined imaged advised pickled outfitted filled withheld
91 bristol chinatown rochester kingston guangdong guangzhou chongqing
93 pursuing posing proposing reacting replacing blogging pointing
Table 6: Example mapping table where the row denotes the first code and the column denotes the second code. Numbers and places are grouped together in row 45, 91. Syntactically similar words are also grouped together in row 48, 54 and 93.

In this section, we describe our experiments on machine translation and image captioning and study our models quantitatively and qualitatively.

Experiment Settings

Machine Translation

We first evaluate our models on the IWSLT 2014 German to English (DE-EN) dataset  (Cettolo et al., 2014).

We employ an LSTM (Hochreiter and Schmidhuber, 1997) seq2seq model with the dot-product attention (Bahdanau, Cho, and Bengio, 2014; Luong, Pham, and Manning, 2015)

as the baseline. We build the baseline using the PyTorch code from

Dai, Xie, and Hovy (2018). For Hybrid-LightRNN, we set to and set and to to represent a total of

words. As model performances exhibit small variances on IWSLT, we run each experiment for five times with different random seeds and report the average performance and the standard deviation.

We also test our best model on the standard WMT 2014 English-to-German (EN-DE) and English-to-French (EN-FR) benchmarks, consisting of and sentence pairs respectively. We follow the preprocessing steps of ConvS2S (Gehring et al., 2017) for the EN-FR task. We employ BPE with merge operations for both tasks. The Transformer model (Vaswani et al., 2017) is employed as our baseline. Our configuration largely follows the configuration of  Vaswani et al. (2017). Specifically, we test both the Base configuration and Big configuration, which respectively have embeddings of dimension 512 and 1024, the dimension of the inner layer 2048 and 4096 and the number of attention heads 8 and 16. We used the Adam optimizer (Kingma and Ba, 2014) with , , and . We set the mixture number to . We use the corpus-level BLEU score (Papineni et al., 2002)

as the evaluation metric. Our Transformer code is based on an open source toolkit THMUT 

(Zhang et al., 2017).

Image Captioning

We conduct experiments on the MSCOCO dataset (Lin et al., 2014) and follow the same preprocessing procedure and the train/validation/test split as used in Karpathy and Fei-Fei (2015). We use the Neural Image Caption (NIC) model (Vinyals et al., 2015) as the baseline model. Following  Dai, Xie, and Hovy (2018), we employ a pretrained 101-layer ResNet (He et al., 2016) instead of a GoogLeNet to extract a feature vector from an input image. We employ an LSTM of size as the decoder. We report BLEU-4, METEOR and CIDERr scores using the scripts provided by Chen et al. (2015).

Experiment Details

For the BPE and Hybrid-LightRNN, we set the code dictionary sizes to for IWSLT and for MSCOCO. We measure the speed and memory usage on a Titan X with PyTorch version v0.3.1 and CUDA 9.0.

Mapping Table Table Size Learned Table BLEU
Hybrid-LightRNN Yes 30.07
Hybrid-LightRNN 29.69
Hybrid-LightRNN 28.73
LightRNN Yes 27.39
Frequency table No 25.84
Random table 24.98
Table 7: Average validation BLEU on IWSLT of Hybrid-LightRNN-MoS using different mapping tables.
Model OOV Translation BLEU
BPE-MoS Yes 30.19
BPE-MoS No 30.14
Hybrid-LightRNN-MoS No 30.07
Table 8: Ablation study on the importance of translating OOV words. The BLEU score are evaluated on the validation set of IWSLT.

Main Results

In our experiments, we denote Hybrid-LightRNN-MoS and BPE-MoS as the seq2seq models with MoS which employ Hybrid-LightRNN and BPE respectively. The baseline seq2seq model without MoS is denoted as Baseline.

Overall Performances on IWSLT and MSCOCO

We show the comparison between a standard LSTM seq2seq model with the Hybrid-LightRNN-MoS and BPE-MoS in Tab. 1. Hybrid-LightRNN-MoS and BPE-MoS both outperform the baseline on both tasks. Specifically, on machine translation, BPE-MoS can outperform the baseline by a BLEU score of . On image captioning, BPE-MoS outperforms the baseline by , and in terms of BLEU-4, METEOR and CIDEr respectively.

This experiment shows that MoS can effectively improve the expressiveness of generation models by learning a high-rank log probability matrix. As expected, the improvement is larger on MT than on image captioning, which can be explained by the differences of language complexities used in these two tasks. Specifically, on image captioning, the captions largely share similar patterns, resulting in a lower-rank probability matrix and a smaller improvement space.

Performances on WMT 14 EN-DE and EN-FR

Since BPE is better than Hybrid-LightRNN with a small margin on IWSLT, we only test BPE-MoS on WMT. As shown in Tab. 2, we achieve and BLEU scores respectively on WMT 14 EN-DE and EN-FR, improving the Transformer model by and BLEU scores. We achieve the state-of-the-art result that does not employ data augmentation on WMT 14 EN-DE. Note that data augmentation can also effectively improve the machine translation performance (Edunov et al., 2018).

Memory and Time Efficiency

We study the memory consumption and efficiency of Hybrid-LightRNN-MoS and BPE-MoS. As shown in Tab. 3, when applying on the Baseline model and the MoS model, BPE and Hybrid-LightRNN can reduce the time and memory usage with no performance losses. In addition, on MT where the vocabulary is large, they can halve the time and memory consumption when applied on MoS with mixtures. When there are more mixutres, the improvements will continue to grow since computing Softmaxes take a larger proportion of time.

Comparisons under the Same Computation Budget

When computational resources are limited, BPE and LightRNN enable the use of more Softmaxes, leading to potentially higher rank probability matrices. Hence, we study the performances of BPE-MoS, Hybrid-LightRNN-MoS and MoS given the same computation budget. As shown in Tab. 4 and Tab. 5, BPE-MoS and Hybrid-LightRNN-MoS consistently outperform the baseline and the MoS model.


In this section, we perform extensive studies to better understand our models.

Number of Softmaxes

Since a larger mixture number would likely to lead to a higher rank log probability matrix, we verify whether a larger mixture number leads to a better performance. We vary the number of mixture in the BPE-MoS model and compare their performances on MT. As shown in Fig. 1, more Softmax components clearly lead to better performances. However, the improvement margin exhibits a diminishing return effect, which means that several Softmaxes are enough to learn a high-rank matrix.

Figure 1: BPE-MoS’s average validation performance over multiple runs on IWSLT with various numbers of mixture components.

Hybrid-LightRNN Ablation Study

We further study the importance of the learned table and the importance of the model’s capacity in Hybrid-LightRNN. Firstly, we vary the dictionary size to investigate whether it is necessary to give enough capacity to frequent words.

As shown in Tab. 7, larger dictionary sizes consistently lead to better performances. Secondly, when compared with LightRNN, Hybrid-LightRNN achieves an improvement of BLEU score, which shows that it is necessary to employ extra capacities for frequent words. Thirdly, as a sanity check of whether the table learning is necessary, we compare the table learned by LightRNN with the table obtained by simply sorting words based on their frequency and the table with random word allocations. The table learned by LightRNN outperforms models with the random table or the frequency-based table by BLEU scores of and respectively, which means that optimizing a language modeling objective learns an effective encoding function.

Is BPE-MoS better because of modeling OOVs?

As indicated in Tab. 1, BPE-MoS is slightly better than Hybrid-Light-MoS on MT and image captioning. In principle, both Hybrid-LightRNN and BPE can model the semantics of all frequent words and rare words in the training set by sharing embeddings with other words. Hybrid-LightRNN may be improved in future works but one exclusive advantage of BPE is the ability to generate out-of-vocabulary (OOV) words. A natural question to ask is “how much performance difference would OOVs cause?” To investigate the importance of modeling OOVs, We take the best BPE-MoS model, replace all generated OOV words with UNK and test its performance.

The comparison is shown in Tab. 8. Removing the generated OOV words do lead to a performance decrease of BLEU score. However, when both Hybrid-LightRNN and BPE are disabled from translating OOV, BPE is still better than Hybrid-LightRNN by a gap of BLEU score. This result indicates that the encoding function learned by BPE better captures the data statistics than the encoding learned by Hybrid-LightRNN, showing that Hybrid-LightRNN has a lot of potentials for improvements.

Mapping Table Qualitative Study

In Hybrid-LightRNN, words in the same column/row share the same column/row embedding vector. Intuitively, it is important to group semantically-similar or syntactically-similar words into the same column/row. We examine whether the learned table have this property in Tab. 6. We find that most words within the same row are either semantically-similar or syntactically-similar to each other.

Conclusions and Discussions

In this work, we investigate two algorithms, i.e., Byte Pair Encoding and Hybrid-LightRNN, to reduce the vocabulary size so as to improve the memory- and time-efficiency of MoS. We evaluate these two methods on machine translation and image captioning and show improved performances over the baseline system without MoS. Further, both of these methods effectively speed up the training process and reduce the memory consumption of MoS with no performance losses. We demonstrate the effectiveness of our models by improved performances on machine translation and image captioning.

In our future work, we would like to study jointly training the encoding function and a task-specific generation model instead of training the encoding function using a task-agnostic model.