Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm (Kudo, 2018). In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout - simple and effective subword regularization method based on and compatible with conventional BPE. It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.READ FULL TEXT VIEW PDF
Neural machine translation (NMT) models typically operate with a fixed
Subword units are an effective way to alleviate the open vocabulary prob...
For different language pairs, word-level neural machine translation (NMT...
Out-of-vocabulary words account for a large proportion of errors in mach...
Dropout is a simple but effective technique for learning in neural netwo...
Transformer architecture achieves great success in abundant natural lang...
This paper introduces Dynamic Programming Encoding (DPE), a new segmenta...
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Using subword segmentation has become de-facto standard in Neural Machine TranslationBojar et al. (2018); Barrault et al. (2019). Byte Pair Encoding (BPE) Sennrich et al. (2016) is the dominant approach to subword segmentation. It keeps the common words intact while splitting the rare and unknown ones into a sequence of subword units. This potentially allows a model to make use of morphology, word composition and transliteration. BPE effectively deals with an open-vocabulary problem and is widely used due to its simplicity.
There is, however, a drawback of BPE in its deterministic nature: it splits words into unique subword sequences, which means that for each word a model observes only one segmentation. Thus, a model is likely not to reach its full potential in exploiting morphology, learning the compositionality of words and being robust to segmentation errors. Moreover, as we will show further, subwords into which rare words are segmented end up poorly understood.
A natural way to handle this problem is to enable multiple segmentation candidates. This was initially proposed by Kudo (2018) as a subword regularization – a regularization method, which is implemented as an on-the-fly data sampling and is not specific to NMT architecture. Since standard BPE produces single segmentation, to realize this regularization the author had to propose a new subword segmentation, different from BPE. However, the introduced approach is rather complicated: it requires training a separate segmentation unigram language model, using EM and Viterbi algorithms, and forbids using conventional BPE.
In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. BPE builds a vocabulary of subwords and a merge table, which specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. During segmentation, words are first split into sequences of characters, then the learned merge operations are applied to merge the characters into larger, known symbols, till no merge can be done (Figure 1(a)). We introduce BPE-dropout – a subword regularization method based on and compatible with conventional BPE. It uses a vocabulary and a merge table built by BPE, but at each merge step, some merges are randomly dropped. This results in different segmentations for the same word (Figure 1(b)). Our method requires no segmentation training in addition to BPE and uses standard BPE at test time, therefore is simple. BPE-dropout is superior compared to both BPE and Kudo (2018) on a wide range of translation tasks, therefore is effective.
Our key contributions are as follows:
We introduce BPE-dropout – a simple and effective subword regularization method;
We show that our method outperforms both BPE and previous subword regularization on a wide range of translation tasks;
We analyze how training with BPE-dropout affects a model and show that it leads to a better quality of learned token embeddings and to a model being more robust to noisy input.
In this section, we briefly describe BPE and the concept of subword regularization. We assume that our task is machine translation, where a model needs to predict the target sentence given the source sentence , but the methods we describe are not task-specific.
To define a segmentation procedure, BPE Sennrich et al. (2016) builds a token vocabulary and a merge table. The token vocabulary is initialized with the character vocabulary, and the merge table is initialized with an empty table. First, each word is represented as a sequence of tokens plus a special end of word symbol. Then, the method iteratively counts all pairs of tokens and merges the most frequent pair into a new token. This token is added to the vocabulary, and the merge operation is added to the merge table. This is done until the desired vocabulary size is reached.
The resulting merge table specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. In this way, it defines the segmentation procedure. First, a word is split into distinct characters plus the end of word symbol. Then, the pair of adjacent tokens which has the highest priority is merged. This is done iteratively until no merge from the table is available (Figure 1(a)).
Subword regularization Kudo (2018) is a training algorithm which integrates multiple segmentation candidates. Instead of maximizing log-likelihood, this algorithm maximizes log-likelihood marginalized over different segmentation candidates. Formally,
where and are sampled segmentation candidates for sentences and respectively, and
are the probability distributions the candidates are sampled from, andis the set of model parameters. In practice, at each training step only one segmentation candidate is sampled.
Since standard BPE segmentation is deterministic, to realize this regularization Kudo (2018) proposed a new subword segmentation. The introduced approach requires training a separate segmentation unigram language model to predict the probability of each subword, EM algorithm to optimize the vocabulary, and Viterbi algorithm to make samples of segmentations.
Subword regularization was shown to achieve significant improvements over the method using a single subword sequence. However, the proposed method is rather complicated and forbids using conventional BPE. This may prevent practitioners from using subword regularization.
We show that to realize subword regularization it is not necessary to reject BPE since multiple segmentation candidates can be generated within the BPE framework. We introduce BPE-dropout – a method which exploits the innate ability of BPE to be stochastic. It alters the segmentation procedure while keeping the original BPE merge table. During segmentation, at each merge step some merges are randomly dropped with the probability . This procedure is described in Algorithm 1.
If is set to 0, the segmentation is equivalent to the standard BPE; if is set to 1, the segmentation splits words into distinct characters. The values between 0 and 1 can be used to control the segmentation granularity.
We use (usually ) in training time to expose a model to different segmentations and during inference, which means that at inference time we use the original BPE. We discuss the choice of the value of in Section 5.
When some merges are randomly forbidden during segmentation, words end up segmented in different subwords; see for example Figure 1(b). We hypothesize that exposing a model to different segmentations may result in better understanding of the whole words as well as their subword units; we will verify this in Section 6.
Our baselines are the standard BPE and the subword regularization by Kudo (2018). Models trained with BPE-dropout use the merge table and the vocabulary built by the baseline BPE.
Subword regularization by Kudo (2018)
has segmentation sampling hyperparametersand . specifies how many best segmentations for each word are produced before sampling one of them, controls the smoothness of the sampling distribution. In the original paper and were shown to perform best on different datasets. Since overall they show comparable results, in all experiments we use .
|Number of sentences||Voc size||Batch size||The value of|
|IWSLT15||En Vi||133k / 1553 / 1268||4k||4k||0.1 / 0.1|
|En Zh||209k / 887 / 1261||4k / 16k||4k||0.1 / 0.6|
|IWSLT17||En Fr||232k / 890 / 1210||4k||4k||0.1 / 0.1|
|En Ar||231k / 888 / 1205||4k||4k||0.1 / 0.1|
|WMT14||En De||4.5M / 3000 / 3003||32k||32k||0.1 / 0.1|
|ASPEC||En Ja||2M / 1700 / 1812||16k||32k||0.1 / 0.6|
We conduct our experiments on a wide range of datasets with different corpora sizes and languages; information about the datasets is summarized in Table 1. These datasets are used in the main experiments (Section 5.1) and were chosen to match the ones used in the prior work Kudo (2018). In the additional experiments (Sections 5.2-5.5), we also use random subsets of the WMT14 English-French data; in this case, we specify dataset size for each experiment.
Prior to segmentation, we preprocess all datasets with the standard Moses toolkit.111https://github.com/moses-smt/mosesdecoder However, Chinese and Japanese have no explicit word boundaries, and Moses tokenizer does not segment sentences into words; for these languages, subword segmentations are trained almost from unsegmented raw sentences.
In training, translation pairs were batched together by approximate sequence length. For the main experiments, the values of batch size we used are given in Table 1 (batch size is the number of source tokens). In the experiments in Sections 5.2, 5.3 and 5.4, for datasets not larger than 500k sentence pairs we use vocabulary size and batch size of 4k, and 32k for the rest.222Large batch size can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update.
The NMT system used in our experiments is Transformer base Vaswani et al. (2017). More precisely, the number of layers is with parallel attention layers, or heads. The dimensionality of input and output is , and the inner-layer of feed-forward networks has dimensionality . We use regularization and optimization procedure as described in Vaswani et al. (2017).
To produce translations, for all models, we use beam search with the beam of 4 and length normalization of 0.6.
In addition to the main results, Kudo (2018) also report scores using -best decoding. To translate a sentence, this strategy produces multiple segmentations of a source sentence, generates a translation for each of them, and rescores the obtained translations. While this could be an interesting future work to investigate different sampling and rescoring strategies, in the current study we use 1-best decoding to fit in the standard decoding paradigm.
For evaluation, we average 5 latest checkpoints and use BLEU Papineni et al. (2002) computed via SacreBleu333Our SacreBLEU signature is: BLEU+case.lc+ lang.[src-lang]-[dst-lang]+numrefs.1+ smooth.exp+tok.13a+version.1.3.6 Post (2018). Since Japanese and Chinese have no explicit word boundaries, prior to computing BLEU we segment Chinese into distinct characters and Japanese using KyTea.444http://www.phontron.com/kytea
The results are provided in Table 2. For all datasets, BPE-dropout improves significantly over the standard BPE: more than 3 BLEU for Zh-En, more than 1.5 BLEU for En-Vi, Vi-En, En-Zh, De-En, Ja-En, and 0.4-1.4 BLEU for the rest. The improvements are especially prominent for smaller datasets; we will discuss this further in Section 5.4.
Compared to Kudo (2018), among the 12 datasets we use BPE-dropout is beneficial for 8 datasets with improvements up to 0.92 BLEU, is not significantly different for 3 datasets and underperforms only on En-Ja. While Kudo (2018) uses another segmentation, our method operates within the BPE framework and changes only the way a model is trained. Thus, lower performance of BPE-dropout on En-Ja and only small or insignificant improvements for Ja-En, En-Zh and Zh-En suggest that Japanese and Chinese may benefit from a language-specific segmentation.
Note also that Kudo (2018) report larger improvements over BPE from using their method than we show in Table 2. This might be explained by the fact that Kudo (2018) used large vocabulary size (16k, 32k), which has been shown counterproductive for small datasets Sennrich and Zhang (2019); Ding et al. (2019). While this may not be the issue for models trained with subword regularization (see Section 5.4), this causes drastic drop in performance of the baselines.
Table 3 shows results for using BPE-dropout only on one side of a translation pair. We select random subsets of different sizes from WMT14 En-Fr data to show how the results are affected by the amount of data.
The results indicate that using BPE-dropout on the source side is more beneficial than on the target side; for the datasets not smaller than 0.5m sentence pairs, BPE-dropout can be used only the source side. We can speculate that it is more important for the model to understand a source sentence than being exposed to different ways to generate the same target sentence.
Since full regularization performs the best for all dataset sizes, in the subsequent experiments we use BPE-dropout on both source and target sides.
Figure 2 shows BLEU scores for the models trained on BPE-dropout with different values of (the probability of a merge being dropped). Models trained with high values of are unable to translate due to a large mismatch between training segmentation (which is close to char-level) and inference segmentation (BPE). The best quality is achieved with .
In our experiments, we use for all languages except for Chinese and Japanese. For Chinese and Japanese, we take the value of to match the increase in length of segmented sentences for other languages.555Formally, for English/French/etc. with BPE-dropout, sentences become on average about 1.25 times longer compared to segmented with BPE; for Chinese and Japanese, we need to set the value of to to achieve the same increase.
Now we will look more closely at how the improvement from using BPE-dropout depends on corpora and vocabulary size.
First, we see that BPE-dropout performs best for all dataset sizes (Figure 3). Next, models trained with BPE-dropout are not sensitive to the choice of vocabulary size: models with 4k and 32k vocabulary show almost the same performance, which is in striking contrast to the standard BPE. This makes BPE-dropout attractive since it allows (i) not to tune vocabulary size for each dataset, (ii) choose vocabulary size depending on the desired model properties: models with smaller vocabularies are beneficial in terms of number of parameters, models with larger vocabularies are beneficial in terms of inference time.666Table 4 shows that inference for models with 4k vocabulary is more than 1.4 times longer than models with 32k vocabulary. Finally, we see that the effect from using BPE-dropout vanishes when a corpora size gets bigger. This is not surprising: the effect of any regularization is less in high-resource settings; however, as we will show later in Section 6.3, when applied to noisy source, models trained with BPE-dropout show substantial improvements up to 2 BLEU even in high-resource settings.
Since BPE-dropout produces more fine-grained segmentation, sentences segmented with BPE-dropout are longer; distribution of sentence lengths are shown in Figure 4 (a) (with , on average about 1.25 times longer). Thus there is a potential danger that models trained with BPE-dropout may tend to use more fine-grained segmentation in inference and hence to slow inference down. However, in practice this is not the case: distributions of lengths of generated translations for models trained with BPE and with BPE-dropout are close (Figure 4 (b)).
Table 4 confirms these observations and shows that inference time of models trained with BPE-dropout is not substantially different from the ones trained with BPE.
In this section, we analyze qualitative differences between models trained with BPE and BPE-dropout. We find, that
when using BPE, frequent sequences of characters rarely appear in a segmented text as individual tokens rather than being a part bigger ones; BPE-dropout alleviates this issue;
by analyzing the learned embedding spaces, we show that using BPE-dropout leads to a better understanding of rare tokens;
as a consequence of the above, models trained with BPE-dropout are more robust to misspelled input.
Here we highlight one of the drawbacks of BPE’s deterministic nature: since it splits words into unique subword sequences, only rare words are split into subwords. This forces frequent sequences of characters to mostly appear in a segmented text as part of bigger tokens, and not as individual tokens. To show this, for each token in the BPE vocabulary we calculate how often it appears in a segmented text as an individual token and as a sequence of characters (which may be part of a bigger token or an individual token). Figure 6 shows distribution of the ratio between substring frequency as an individual token and as a sequence of characters (for top-10 most frequent substrings).
For frequent substrings, the distribution of token to substring ratio is clearly shifted to zero, which confirms our hypothesis: frequent sequences of characters rarely appear in a segmented text as individual tokens. When a text is segmented using BPE-dropout with the same vocabulary, this distribution significantly shifts away from zero, meaning that frequent substrings appear in a segmented text as individual tokens more often.
Now we will analyze embedding spaces learned by different models. We take embeddings learned by models trained with BPE and BPE-dropout and for each token look at the closest neighbors in the corresponding embedding space. Figure 5 shows several examples. In contrast to BPE, nearest neighbours of a token in the embedding space of BPE-dropout are often tokens that share sequences of characters with the original token. To verify this observation quantitatively, we computed character 4-gram precision of top-10 neighbors: the proportion of those 4-grams of the top-10 closest neighbors which are present among 4-grams of the original token. As expected, embeddings of BPE-dropout have higher character 4-gram precision (0.29) compared to the precision of BPE (0.18).
This also relates to the study by Gong et al. (2018). For several tasks, they analyze the embedding space learned by a model. The authors find that while a popular token usually has semantically related neighbors, a rare word usually does not: a vast majority of closest neighbors of rare words are rare words. To confirm this, we reduce dimensionality of embeddings by SVD and visualize (Figure 7). For the model trained with BPE, rare tokens are in general separated from the rest; for the model trained with BPE-dropout, this is not the case. While to alleviate this issue Gong et al. (2018) propose to use adversarial training for embedding layers, we showed that a trained with BPE-dropout model does not have this problem.
Models trained with BPE-dropout better learn compositionality of words and the meaning of subwords, which suggests that these models have to be more robust to noise. We verify this by measuring the translation quality of models on a test set augmented with synthetic misspellings. We augment the source side of a test set by modifying each word with the probability of by applying one of the predefined operations. The operations we consider are (1) removal of one character from a word, (2) insertion of a random character into a word, (3) substitution of a character in a word with a random one. This augmentation produces words with the edit distance of 1 from the unmodified words. Edit distance is commonly used to model misspellings Brill and Moore (2000); Ahmad and Kondrak (2005); Pinter et al. (2017).
Table 5 shows the translation quality of the models trained on WMT 14 dataset when given the original source and augmented with misspellings. We deliberately chose large datasets, where improvements from using BPE-dropout are smaller. We can see that while for the original test sets the improvements from using BPE-dropout are usually modest, for misspelled test set the improvements are a lot larger: 1.6-2.3 BLEU. This is especially interesting since models have not been exposed to misspellings during training. Therefore, even for large datasets using BPE-dropout can result in substantially better quality for practical applications where input is likely to be noisy.
Closest to our work in motivation is the work by Kudo (2018), who introduced the subword regularization framework and a new segmentation algorithm. Other segmentation algorithms include Creutz and Lagus (2006), Schuster and Nakajima (2012), Chitnis and DeNero (2015), Kunchukuttan and Bhattacharyya (2016), Wu and Zhao (2018), Banerjee and Bhattacharyya (2018).
Regularization techniques are widely used for training deep neural networks. Among regularizations applied to a network weights the most popular are DropoutSrivastava et al. (2014) and
regularization. Data augmentation techniques in natural language processing include dropping tokens at random positions or swapping tokens at close positionsIyyer et al. (2015); Artetxe et al. (2018); Lample et al. (2018), replacing tokens at random positions with a placeholder token Xie et al. (2017), replacing tokens at random positions with a token sampled from some distribution (e.g., based on token frequency or a language model) Fadaee et al. (2017); Xie et al. (2017); Kobayashi (2018). While BPE-dropout can be thought of as a regularization, our motivation is not to make a model robust by injecting noise. By exposing a model to different segmentations, we want to teach it to better understand the composition of words as well as subwords, and make it more flexible in the choice of segmentation during inference.
Several works study how translation quality depends on a level of granularity of a segmentation Cherry et al. (2018); Kreutzer and Sokolov (2018); Ding et al. (2019). Cherry et al. (2018) show that trained long enough character-level models tend to have better quality, but it comes with the increase of computational cost for both training and inference. Kreutzer and Sokolov (2018) find that, given flexibility in choosing segmentation level, the model prefers to operate on (almost) character level. Ding et al. (2019) explore the effect of BPE vocabulary size and find that it is better to use small vocabulary for low-resource setting and large vocabulary for a high-resource setting. Following these observations, in our experiments we use different vocabulary size depending on a dataset size to ensure the strongest baselines.
We introduce BPE-dropout – simple and effective subword regularization, which operates within the standard BPE framework. The only difference from BPE is how a word is segmented during model training: BPE-dropout randomly drops some merges from the BPE merge table, which results in different segmentations for the same word. Models trained with BPE-dropout (1) outperform BPE and the previous subword regularization on a wide range of translation tasks, (2) have better quality of learned embeddings, (3) are more robust to noisy input. Future research directions include adaptive dropout rates for different merges and an in-depth analysis of other pathologies in learned token embeddings for different segmentations.
The journal of machine learning research15 (1), pp. 1929–1958. External Links: Cited by: §7.