1 Introduction
Language modelling (LM) is a fundamental task in natural language that requires a parametric model to generate tokens given past tokens. LM underlies all other types of structured modelling tasks in natural language, such as Named Entity Recognition, Constituency/Dependency Parsing, Coreference Resolution, Machine Translation
(Sutskever et al., 2014) and Question Answering (Mikolov et al., 2010). The goal is to learn a joint probability distribution for a sequence of length containing words from a vocabulary. This distribution can be decomposed into the conditional distributions of current tokens given past tokens using the chain rule, as shown in
Equation 1. In Neural Language Modelling (NLM), a Recurrent Neural Network (RNN)
parameterized by is used to encode the information at each timestepinto a hidden state vector
which is followed by a decoder and a normalization function which forms a probability distribution , .(1) 
However, training can be slow when is large while also leaving a large memory footprint for the respective input embedding matrices. Conversely, in cases where the decoder is limited by an information bottleneck (Yang et al., 2017)
, the opposite is required where more degrees of freedom are necessary to alleviate information loss in the decoder bottleneck. Both scenarios correspond to a tradeoff between computation complexity and outofsample performance. Hence, we require that a newly proposed model has the property that the decoder can be easily configured to deal with this tradeoff in a principle way.
Lastly, standard supervised learning (selfsupervised for language modelling) assumes inputs are i.i.d. However, in sequence prediction, the model has to rely on its own predictions at test time, instead of past targets that are used as input at training time. This difference is known as
exposure bias and can lead to errors compounding along a generated sequence. This approach to sequence prediction is also known as teacher forcing where the teacher provides targets that are used at training time. We also require that exposure bias is addressed in our approach while dealing with the aforementioned challenges related to computation and performance tradeoffs in the decoder.We propose an errorcorrecting output code (ECOC) based NLM (ECOCNLM) that address this desiderata. In the approximate case where codeword dimensionality , we show that that given sufficient error codes (), we maintain accuracy compared to traditional NLMs that use the full softmax and other approximate methods. Lastly, we show that this latentbased NLM approach can be extended to mitigate the aforementioned problem of compounding errors by using Latent Mixture Sampling (LMS). LMS in an ECOCNLM model also outperforms an equivalent Hierarchical Softmaxbased NLM that uses Scheduled Sampling (Bengio et al., 2015) and other closely related baselines. To our knowledge, this is the first latentbased technique to mitigating compounding errors in recurrent neural networks (RNNs).
Our main contributions are summarized as the following:

An errorcorrecting output coded neural language model that requires less parameters than its softmaxbased language modelling counterpart given sufficient separability between classes via errorchecks.

An embedding cosine similarity rank ordered codebook that leads to wellseparated codewords.

A LatentMixture Sampling method to mitigate exposure bias in latentvariable models. This is then extended to Differentiable Latent Mixture Sampling
that uses the GumbelSoftmax so that discrete categorical variables can be backpropogated through.

Novel baselines such as Scheduled Hierarchical Sampling (SSHS) and Scheduled Adaptive Sampling (SSAS), are introduced in the evaluation of our proposed ECOC method. This applies scheduled sampling to two closely related softmax approximation methods.
2 Background
2.1 ErrorCorrecting Codes
ErrorCorrecting Codes (Hamming, 1950)
originate from seminal work in solidstate electronics around the time of the first digital computer. Later, binary codes were introduced in the context of artificial intelligence via the NETtalk system
(Sejnowski & Rosenberg, 1987), where each class index is represented by their respective binary code . A parametric model is then used to predict a probability for each binary bit positions being active or not. This results in a predicted which can then be measured against the ground truth . At training time we optimize for some objective . At test time we choose the codeword that is closest to the predicted code, allowing for Hamming distances between codewords and the errorcorrecting codes. When , the remaining codes are used as errorcorrection bits . This tolerance can be used to account for the information loss due to the sample size by measuring the distance (e.g Hamming) between the predicted code word and the true codeword with errorcorrection bits. If the minimum distance between codewords is then at least bits can be corrected for and hence, if the Hamming distance we will still retrieve the correct codeword. In contrast to using one bit per classes in standard multiclass classification, errorcorrection cannot be achieved. Both errorcorrection bits and class bits make up the codebook .In order to achieve good separation between classes (i.e codewords that are assigned so that it less likely to make mistakes due close Hamming distances), the codes should have good row and column separation. The former refers to having equidistant Hamming distances, where remaining codes are the errorcorrecting codes. Column separation refers to ensuring that the functions for each bitposition are uncorrelated with one another. This can be achieved by maximizing the Hamming distance between the columns, similar to rowseparation. A primary aim in ECOC is that the bit errors are uncorrelated and that likelihood of simultaneous errors occurring is low. This property makes it easier for errorcorrecting codes to recorrect errors. If many simultaneous errors are made, this becomes more difficult.
ECOCs have been used for multiclass document classification (Berger, 1999). The authors also propose a coding theory argument as to why randomly assigned codes can result in wellseparated codes. However, they make the strong assumption of class independence which is weak for typical natural language problems. This work addresses by using semanticallydriven separation for language modelling.
ECOC can be considered an ensembled method for multiclassification since each model needs to make a prediction for each binary unit for binary codes on the output (similar to Bagging in ensemble learning), albeit a distinct code that is being predicted in binary classification. Kong & Dietterich (1995)
have shown that this distinction over voting methods leads to variance reduction and biascorrection in each respective ECOC classifier. This is different to regular multiclass classification where one prediction is made from a distribution over
classes and .2.2 Why Latent Codes for Neural Language Modelling?
Targets in standard training of NLMs are represented as 1hot vectors (i.e kronecker delta) and the problem is treated as a 1vsrest multiclass classification. This can be considered a special case of ECOC classification where the codebook with classes is represented by an identity . ECOC classification is well suited over explicitly using observed variables when the output space is structured. Flatclassification (1vsrest) ignores the dependencies in the outputs, in comparison to using latent codes that share some common latent variables between associative words. For example, in the former case, if we train a model that only observes the word silver in a sequence …silver car.. and then at testtime observes silverback, because there is high association between silver and car, the model is more likely to predict car instead of gorilla. ECOC is less prone to such mistakes because although a/some bit/s may be different between the latent codes for car and gorilla, the potential misclassifications can be recorrected with the errorcorrecting bits. In other words, latent coding can reduce the variance of each individual classifier and has some tolerance to mistakes induced by sparse transitions, proportional to the number of errorchecks used. Furthermore, 1vsrest classification requires class boundaries be learned at once, whereas in ECOC we must only build class boundaries for , typically closer to the lowerbound of this interval. In fact, in the case of language modelling where is commonly large (e.g ), the boundary is learned multiple times and therefore is more likely to recover from mistakes, in the same way ensembles reduce variance in prediction (Kong & Dietterich, 1995).
2.3 Methods for Softmax Approximation
2.3.1 LossBased Methods
Hierarchical Softmax (HS) Goodman (2001); Morin & Bengio (2005) propose to use short codes that represent marginals of the conditional distribution, where the product of these marginals that are gotten along a path in the tree approximate the conditional distribution. This speeds up training by summing over the paths of a binary tree where intermediate nodes assign relative probabilities of child nodes. Therefore, only few sums are necessary, along the binary path to a given leaf (i.e word). The probability for an embedded word vector at an intermediate node is the product of taking left or right turn at every intermediate node. The probability of transitioning right is . The conditional is then where is the path depth to , is a decision point along a path that transitions to the left child or the right child . Defining a good tree structure improves performance since semantically similar words have a shorter path and therefore similar representations are learned for similar words. This can be achieved by either clustering words via term frequency using a Huffman Tree (Mikolov et al., 2013) or using already defined word groups from semantic networks (Morin & Bengio, 2005) such as WordNet. As we will discuss in section 4
, we also build upon HS by proposing a method that interpolates between predicted codes and target codes to make the model more robust to its own errors and use this as a reasonable baseline for ECOCNLM that also integrates this method into training.
Differentiated Softmax (DS) uses a sparse linear block of weights for the decoder where a set of partitions are made according to the unigram distribution, where the number of weights are assigned proportional to the term frequency. This intuitive since rare words require less degrees of freedom to account for the little amount of contexts in which they appear, in comparison to common words. For a sparse decoder weight matrix , each partition has dimensionality where for common words and rare words . Both the number of partitions and the dimensionality can be tuned at training time.
Adaptive Softmax (AS) Grave et al. (2016) provide an approximate hierarchical model that directly accounts for the computation time of matrix multiplications. AS results in 2x10x speedups when compared to the standard softmax, dependent on the size of the corpus and vocabulary. Interestingly, they find on sufficiently large corpora (Text8, Europarl and 1Billion datasets), they manage to maintain accuracy while reducing the computation time.
2.3.2 SamplingBased Methods
Importance Sampling (IS) is a classical Montecarlo sampling method used to approximate probability distributions. A proposal distribution is used to approximate the true distribution which allows us to draw MonteCarlo samples at much less cost than if we were to draw directly from . is often chosen to be simple and close to . In language modelling, it is common that the unigram distribution is used for . The expectation over sampled word that is approximately the gradient of . However, computing for each sample is still required and therefore there have been methods to compute the product of marginals that avoids expensive normalization over the MC samples (Bengio et al., 2003a). Adaptive Importance Sampling (Jean et al., 2014) (AIS) only considers a set fraction of target tokens to sample from. This involves partitioning the training set where each partition is designated a subset of the vocabulary . Therefore, there is a separate predictive distribution for each partition and for a given partition all words within are assigned some probability.
Noise Contrastive Estimation (Mnih & Teh, 2012)
propose to use Noise Contrastive Estimation (NCE)
(Gutmann & Hyvärinen, 2010) as a sampling method in an unnormalized probabilistic model that is more stable than IS (can lead to diverging model distribution in relation to underlying distribution ). Similar to our proposed ECOCNLM, NCE treats density estimation as multiple binary classification problems, but different in that it does so by discriminating between both data samples and a known noise distribution (e.g noise proportional to the unigram distribution). The posterior can be expressed in the context independent case as for hidden state , parameterized by for generated samples. NCE is different from IS since it does not estimate the word probabilities directly, but instead uses an auxiliary loss that maximizes the probability of the correct words from noisy samples.2.4 Recent Applications of Latent Codes
Shu & Nakayama (2017) recently used compositional codes for word embeddings to cut down on memory requirements in mobile devices. Instead of using binary coding, they achieve word embedding compression using multicodebook quantization. Each bit comprises of a discrete code (09) and therefore at minimum
bits are required. They too propose to use GumbelSoftmax trick but for the purposes of learning the discrete codes. The performance was maintaned for sentiment analysis and machine translation with 94% and 98% respective compression rates.
Shi & Yu (2018)propose a product quantizatioon structured embedding that reduces memory by 1020 times the number of parameters, while maintaining performance. This involves slicing the embedding tensor into groups which are then quantized independently. The embedding matrix is represented with an index vector and a codebook tensor of the quantized partial embedding.
Zhang et al. (2013) propose a weighted Hamming Distance to avoid ties in ranking Hamming Distances which is common, particularly for short codes. This is relevant to our work as well in the context of assigning errorchecking bits by Hamming distance to codewords that correspond to classes in .3 Codebook Construction
A challenging aspect of assigning codewords is ordering the codes such that if errors made that the resulting incorrect codeword is at least, semantically closer to that of the codewords that are less related, while ensuring good separation between codes. Additionally, we have to consider the amount of errorchecking bits to use. In theory, is sufficient to account for all classes. However, this alone can lead a degradation in performance. Hence, we also consider a large amount of errorchecking bits. In this case, the errorchecking bits can account for more mistakes given by other classes, which may be correlated. In contrast, using probability distributions naturally account for these correlations, as the mass needs to shift relative to the activation of each output. This point is particularly important for language modelling because of the highdimensionality of the output. The most naive way to create the codebook is to simply assign binary codes to each word in random order. However, it is preferable to assign similar codes to words in the vocabulary that are semantically similar while maximizing the Hamming distance between codes where leftover error codes separate class codes.
3.1 Codebook Arrangement
A fundamental challenge in creating the codebook is in how errorcodes are distributed between codes that maximize the separability between codewords that are more likely to be interchangeably and incorrectly predicted. This is related to the second challenge of choosing the dimensionality of . The latter is dependent on the size of the corpus, and in some cases might only require bits to represent all classes with leftover errorchecking bits. These two decisions correspond to a tradeoff between computational complexity and accuracy of our neural language model, akin to tree expressitivity in the Hierarhcial Softmax to using the Full Softmax. Below we describe a semantically motivated method to achieve wellseparated codewords, followed by a guide on how to choose codebook dimensionality .
3.1.1 Embedding SimlarityBased Codebooks
Previous work on ECOC has focused on theories as to why randomly generated codes lead to good row and column separation (Berger, 1999). However, this assumes that class labels are conditionally independent and therefore it does not apply well for language modelling where the output space is loosely structured. To address this, we propose to reorder such that Hamming distance between any two codewords is proportional to the embedding similarity. Moreover, separating codewords by semantic similarity can be achieved by placing the amount of errorchecking bits proportional to rank ordered similarity for a chosen query word embedding. A codebook ordered by embedding similarity for is denoted as . The similarity scores between embeddings is given as is used reorder . Good separation is achieved when codes are separated proportional to the cosine similarity between embeddings of the most frequent word and the remaining words . Therefore, words with high similarity have corresponding codes that are closer in Hamming distance in . This ensures that even when codes are correlated, that incorrect latent predictions are at least more likely to correspond to semantically related words. We are not guaranteed that codes close in Hamming distance are closer in a semantic sense in the random case. Given redundant codewords , we require an assignment that leads to a strongly separated . Let denote a function that assigns errorchecking codewords assigned to the class codeword and . In practice normalizes the resulting embedding similarities using a normalization function cumsum to assign the intervals between adjacent codeword spans. This results in greater distance between words that are more similar to , and less errorchecking codewords to relatively rarer words that tend to have little neighbouring words in the embedding space.
3.1.2 Random Codebooks
Berger (1999) find that a well rowseparated binary can be defined as one where all rows have a relative Hamming separation at least . The probability that a randomlyconstructed binary matrix is not well rowseparated is at most . Further, it holds that for any two rows in a wellseparated which have a relative Hamming separation in the range, the probability that a randomly constructed is not strongly wellseparated is at most . We consider these random codebooks as one of the baselines when evaluating ECOC against other related approximate methods in NLM.
4 Latent Mixture Sampling
To mitigate exposure bias for latentbased language modelling we propose a sampling strategy that interpolates between predicted and target codewords. We refer to this as Latent Mixture Sampling (LMS) and its application to ECOC as Codeword Mixture Sampling (CMS).
4.1 CurriculumBased Latent Mixture Sampling
In CurriculumBased Latent Mixture Sampling (CLMS), the mixture probability is
at epoch
and throughout training the probability monotonically increases where is the threshold for the th bit after epochs. A Bernoulli sample is carried out for each timestep in each minibatch. The probabilities per dimension are independent of keeping a prediction instead of the th bit in the target codeword at timestep . The reason for having individual mixture probabilities per bits is because when we consider a default order in , this results in tokens being assigned codewords ranked by frequency. Therefore, the leftmost bit predictions are more significant than bit errors near the beginning (e.g only 1 bit difference). In this paper we report results when using a sigmoidal schedule as shown in Equation 2 where represents the temperature at the last epoch and is a scaling factor that controls the slope of the sigmoid (in our experiments ).(2) 
4.2 Latent SoftMixture Sampling
In standard CMS, we pass token index which is converted to an input embedding based on the most probable bits predictions at the last time step . We can instead replace the argmax operator with a soft argmax that uses a weighted average of embeddings where weights are assigned from the previous predicted output via the softmax normalizatio , where
controls the kurtosis of the probability distribution (
tends to the argmax), as shown in Equation 3.(3) 
In the ECOCNLM, we consider binary codewords and therefore choose the top least probable bits to flip according to the curriculum schedule. Hence, this results in codewords where each has at least hamming distance . Concretely, this is a soft interpolation between past targets and a weighted sum of the most probable codewords such that where samples one or the other for each dimension of .
5 Differentiable Latent Sampling
The previous curriculum strategies disregard where the errors originate from. Instead, they interpolate between model predictions of latent variables and targets in a way that does not distinguish between cascading errors and localized errors. This means that it only recorrects errors after they are made instead of directly correcting for the origin of the errors. Maddison et al. (2016) showed that such operations can be approximated by using a continuous relaxation using the reparameterization trick, also known as the Concrete Distribution. By applying such relaxation it allows us to sample from the distribution across codes while allowing for a fully differentiable objective, similar to recent work (Goyal et al., 2017)
. We extend this to mixture sampling by replacing the argmax operation with the Concrete distribution to allow the gradients to be adjusted at points where prior predictions changed value throughout training. This not only identifies at which timestep the error occurs, but what latent variables (i.e output codes) had the most influence in generating the error. This is partially motivated by the finding that in the latent variable formulation of simple logistic regression models, the latent variable errors form a Gumbel distribution. Hence, sampling latent codes inversely proportional to the errors from a Gumbel distribution would seem a suitable strategy.
GumbelSoftmax Similarly, instead of passing the most likely predicted word , we can instead sample from and then pass this index as . This is an alternative to always acting greedily and allow the model to seek other likely actions. However, in order to compute derivatives through samples from the softmax, we need avoid discontinuities, such as the argmax operation. The GumbelSoftmax (Maddison et al., 2016; Jang et al., 2016) allows us to sample and differentiate through the softmax by providing a continuous relaxation results in probabilities instead of a step function (i.e argmax). For each componentwise Gumbel noise for latent variable , we find that maximizes and then set and , where and is drawn from a discrete distribution .
(4) 
For ECOC, we instead consider Bernoulli random variables which for the Concrete distribution can be expressed by means of two arbitrary Gumbel distributions
and . The difference between and follows a Logistic distribution and so and is sampled as . Hence, if , then . For a step function , , corresponding to the Gumbel MaxTrick (Jang et al., 2016). The sampling process for a Binary Concrete random variable involves sampling , sample and set as shown in Equation 5, where and . This Binary Concrete distribution is henceforth denoted as with location and temperature .(5) 
This is used for ECOC and other latent variablebased models, such as Hierarchical Sampling, to propogate through past decisions and make corrective updates that backpropogate to where errors originated from along the sequence. Hence, we also carry out experiments with BinConcrete (Equation 5) and GumbelSoftmax( Equation 4) for HS and ECOC. The temperature can be kept static, annealed according to a schedule or learned during training, in the latter case this is equivalent to entropy regularization (Szegedy et al., 2016; Grandvalet & Bengio, 2005)that controls the kurtosis of the distribution. In this work, we consider using an annealed , similar to Equation 2 where and starts with . This is done to allow the model to avoid large gradient variance in updates early on. In the context of using the GumbelSoftmax in LMS, this allows the model to become robust to nongreedy actions gradually throughout training, we would expect such exploration to improve generalization proportional to the vocabulary size.
6 Experimental Setup
We carry out experiments for a 2hidden layer LongShort Term Memory (LSTM) and Gated Recurrent Unit (GRU) model with embedding size
, Backpropogation Through Time (BPTT) length and variational dropout (Gal & Ghahramani, 2016) with rate =0.2 for input, hidden and output layers. The ECOCNLM model is trained using Binary Cross Entropy loss for errorchecking codewords, with respective gradients .Baselines for ECOCNeural Language Model The first set of experiments include comparisons against the most related baselines, which include SampleSoftmax (Bengio et al., 2003b; Bengio & Senécal, 2008),Hierarchical Softmax (HS), AS (Grave et al., 2016), and NCE (Mnih & Teh, 2012). For HS, we use a 2hidden layer tree with a branching factor (number of classes) of by default. For AS, we split the output into 4 groups via the unigram distribution (in percentages of total words 5%15%30%100%). For NCE, we set the noise ratio to be 0.1 for PTB and 0.2 for WikiText2 and WikiText103. Training is carried out until near convergence (), the randomly initialized HS and Sampled Softmax of which take longer ( [5580]). Table 1 reports the results for number of samples in the case of Rand/UnigramSampleSM. For Rand/Unigram Hierarchical SM, we use a 2hidden layer tree with 10 classes per child node.
Baselines ECOC Mixture Sampling To test Latent Mixture Sampling (LMS), we directly compare its application in HS and ECOC, two closely related latent NLM methods. Additionally, we compare the performance of LMS against the most related samplingbased supervised learning technique called scheduled sampling (SS) (Bengio et al., 2015). For SS with crossentropy based training (SSCE), we also consider using a baseline of the softargmax (SoftSSCE) where a weighted average embedding is generated proportional to the predicted probability distribution.
Evaluation Details In order to compute perplexities for ECOCNLM, we must view the codewords in terms of a product of marginal probabilities. At training time we choose the most confident prediction within the span of error checks for a codeword such that i.e among the error checks corresponding to a particular token, we choose the most probable of these checks as the value when computing the binary crossentropy loss. At test time, if the predicted codeword falls within the errorchecking bits of codeword , then it is deemed a correct prediction and assigned the highest probability of all predictions. We note that we only convert the ECOC predictions to perplexities to be comparable against baselines. ECOCs can also be evaluated using Hamming Distance or Mean Reciprocal Rank when the codes are ordered semantically or by Hamming distance (i.e UnigramECOC or EmbeddingECOC).
7 Results
7.1 ErrorCorrecting Output Coded NLM Results
We first compare our proposed ECOCNLM to aforementioned methods that approximate softmax normalization, using binary trees and latent codes that are ordered according to unigram frequency (UnigramHierarchicalSM and UnigramECOC). This is also the same ordering we use to compare our proposed CMSECOC sampling method to scheduled sampling (Bengio et al., 2015) in standard crossentropy training with softmax normalization. Although, these are not directly comaprable, since ECOCNLM introduces a whole new paradigm, we use the common evaluation measures of Hamming distance and accuracy to have some kind of baseline with which we can compare our proposed method to. Figure 2 shows how the reduction in perplexity as the number of ECOCLSTM decoder parameters increase as more bits are added to the codeword. For PTB, large perplexity reductions are made between 14100 codebits, while between 1001000 codebits there is a gradual decrease. In contrast, we see that there is more gained from increasing codeword size for WikiText2 and WikiText103 (which preserve the words that fall within the longtail of the unigram distribution). We find the discrepancy in performance between randomly assigned codebooks and ordered codebooks is more apparent for large compression (). Intuitively, the general problem of wellseparated codes is alleviated as more bits are added.
Table 1 shows that overall ECOC with a rank ordered embedding similarity (EmbeddingECOC) almost performs as well as the fullsoftmax (8.02M parameters) while only using 1000 bits for PTB ( and ) and 5K bits for WikiText2 () and WikiText103 (). The HSbased models use a 2hidden layer tree with 10 tokens per class, resulting in 4.4M parameters for PTB, 22.05M parameters for WikiText2 (full softmax  40.1M) and WikiText103. Moreover, we find there is a consistent improvement in using EmbeddingECOC over using a random codebook (RandomECOC) and a slight improvement over using a unigram ordered codebook (UnigramECOC). Note that in both EmbeddingECOC and UnigramECOC, the number of errorchecking bits are assigned inversely proportional to the rank position when ordering embedding similarities (as discussed in subsubsection 3.1.1) and unigram frequency respectively. We also found that too many bits e.g takes much longer ( [2030] more for PTB) to converge with negligible perplexity reductions. Hence, the main advantage of ECOCNLMs is the large compression rate while maintaining performance (e.g PTB with , there is less than 2 perplexity points compared to the full softmax).
Model  PTB  WikiText2  WikiText103  

Val.  Test  Val.  Test  Val.  Test  
Full SM  
GRU  85.49  78.81  126.28  122.66  59.23  51.44 
LSTM  86.19  79.24  124.01  119.30  56.72  49.35 
RandSampleSM  
GRU  94.42  83.79  138.91  131.48  70.08  60.80 
LSTM  92.14  81.82  136.47  129.29  68.95  59.34 
UnigramSampleSM  
GRU  91.23  82.45  134.49  128.29  67.10  57.62 
LSTM  90.37  81.36  133.08  127.19  66.23  57.09 
RandHierarchicalSM  
GRU  96.83  89.93  134.11  127.88  65.01  55.79 
LSTM  94.31  88.50  133.69  127.12  62.29  54.28 
UnigramHierarchicalSM  
GRU  94.35  87.67  131.34  124.91  63.18  54.67 
LSTM  92.38  86.70  130.26  124.83  62.02  54.11 
AdaptiveSM  
GRU  92.11  85.74  129.90  122.26  60.95  53.03 
LSTM  91.38  85.29  118.89  120.92  60.27  52.63 
NCE  
GRU  98.62  92.03  131.34  126.17  62.68  54.90 
LSTM  96.79  89.30  131.20  126.82  61.11  54.52 
RandomECOC  
GRU  92.47  87.28  132.61  124.22  61.33  52.80 
LSTM  91.00  87.19  131.01  123.29  56.12  52.43 
UnigramECOC  
GRU  87.43  80.39  127.79  120.97  58.12  51.88 
LSTM  86.44  82.29  129.76  120.51  52.71  48.37 
EmbeddingECOC  
GRU  86.03  80.45  127.40  122.01  58.28  51.67 
LSTM  84.40  77.53  125.06  120.34  57.37  49.09 
7.2 Latent Mixture Sampling Results
Figure 3 shows how validation perplexity on WikiText2 changes throughout training an LSTM as begins to tend to =2.5, =10 and the case where =1 is kept constant. We see that too much exploration (=10) leads to an increase in perplexity, as , the validation perplexity begins to rise. In contrast, we find a slow monotonic increase to =2.5 leads to a steady increase, at which (epoch 24) the model has almost converged. Table 2 shows all results of LMS when used in HS and ECOCbased NLM models. We baseline this against both SS and the softargmax version of SS, the most related samplebased supervised learning approach to LMS. Furthermore, we report results on CLMSECOC (CurriculumLMS ECOC) which mixes between true targets and codewords predictions according to the schedule in Equation 2 and a differentiable extension of LMS via samples from the GumbelSoftmax (DCMSECOC). At training time for both DCMSECOC and DLMSHierarchicalSM we sample from each softmax defined along the path to the target code. We find that using a curriculum in CMSECOC to perform better in general when mixing code predictions and targets, outperforming the full softmax that uses scheduled sampling (SSSM). Lastly, we note that DLMSECOC is comparable in performance to CLMSECOC, and improves performance on WikiText2. Consistently, there has been an improvement using LMS over SS which suggests that LMS is an effective alternative when directly optimizing over latent variables i.e mixture sampling is less suited when using the full softmax since the target is extremely sparse (dirac delta distribution).
PTB  WikiText2  WikiText103  
Val.  Test  Val.  Test  Val.  Test  
SSSM  
GRU  82.49  75.36  123.39  119.71  57.22  49.39 
LSTM  81.17  75.24  124.01  119.30  56.72  49.35 
SoftSSSM  
GRU  78.23  70.60  120.34  116.04  54.49  46.59 
LSTM  77.48  69.81  119.93  115.27  54.02  45.77 
SSAdaptiveSM  
GRU  82.11  77.88  122.23  118.57  58.36  51.01 
LSTM  82.45  78.03  122.37  118.59  57.81  49.08 
SSHierarchicalSM  
GRU  85.29  78.83  124.24  121.60  60.48  52.19 
LSTM  85.56  78.17  123.88  120.91  59.76  51.59 
SSECOC  
GRU  86.14  78.44  125.12  121.52  58.49  50.68 
LSTM  86.02  78.57  124.39  120.81  58.23  50.29 
SoftSSECOC  
GRU  85.78  78.12  124.69  121.13  58.18  50.33 
LSTM  85.11  77.59  123.94  120.82  57.01  49.26 
CLMSHierarchicalSM  
GRU  84.09  77.83  124.31  120.60  59.69  51.27 
LSTM  84.11  77.13  123.23  121.35  59.56  50.41 
DLMSHierarchicalSM  
GRU  82.47  78.03  124.07  122.27  59.31  53.18 
LSTM  81.83  77.40  123.51  121.78  58.63  52.72 
CLMSECOC  
GRU  80.34  78.55  122.89  118.07  58.29  50.37 
LSTM  80.67  78.39  122.27  117.90  57.81  50.03 
DLMSECOC  
GRU  79.34  74.25  120.89  117.89  58.71  50.28 
LSTM  79.67  76.39  119.27  117.41  59.35  51.67 
8 Conclusion
This work proposed an errorcorrecting neural language model and a novel Latent Mixture Sampling method for latent variable models. We find that performance is maintained compared to using the full conditional and related approximate methods, given a sufficient codeword size to account for correlations among classes. This corresponds to 40 bits for PTB and 100 bits for WikiText2 and WikiText103. Furthermore, we find that performance is improved when rank ordering the codebook via embedding similarity where the query is the embedding of the most frequent word. Lastly, we introduced Latent Mixture Sampling which can be integrated into training latentbased language models, such as the ECOCbased language model, to mitigate exposure bias. We find that this method outperforms wellknown samplingbased methods for reducing exposure bias when training neural language models with maximum likelihood.
References
 Bengio et al. (2015) Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179, 2015.
 Bengio & Senécal (2008) Bengio, Y. and Senécal, J.S. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.
 Bengio et al. (2003a) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003a.
 Bengio et al. (2003b) Bengio, Y., Senécal, J.S., et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pp. 1–9, 2003b.
 Berger (1999) Berger, A. Errorcorrecting output coding for text classification. In IJCAI99: Workshop on machine learning for information filtering, 1999.
 Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pp. 1019–1027, 2016.
 Goodman (2001) Goodman, J. T. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–434, 2001.
 Goyal et al. (2017) Goyal, K., Dyer, C., and BergKirkpatrick, T. Differentiable scheduled sampling for credit assignment. arXiv preprint arXiv:1704.06970, 2017.
 Grandvalet & Bengio (2005) Grandvalet, Y. and Bengio, Y. Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536, 2005.
 Grave et al. (2016) Grave, E., Joulin, A., Cissé, M., Grangier, D., and Jégou, H. Efficient softmax approximation for gpus. arXiv preprint arXiv:1609.04309, 2016.
 Gutmann & Hyvärinen (2010) Gutmann, M. and Hyvärinen, A. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
 Hamming (1950) Hamming, R. W. Error detecting and error correcting codes. Bell System technical journal, 29(2):147–160, 1950.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Jean et al. (2014) Jean, S., Cho, K., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.
 Kong & Dietterich (1995) Kong, E. B. and Dietterich, T. G. Errorcorrecting output coding corrects bias and variance. In Machine Learning Proceedings 1995, pp. 313–321. Elsevier, 1995.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
 Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
 Mnih & Teh (2012) Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
 Morin & Bengio (2005) Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252, 2005.
 Sejnowski & Rosenberg (1987) Sejnowski, T. J. and Rosenberg, C. R. Parallel networks that learn to pronounce english text. Complex systems, 1(1):145–168, 1987.
 Shi & Yu (2018) Shi, K. and Yu, K. Structured word embedding for low memory neural network language model. Proc. Interspeech 2018, pp. 1254–1258, 2018.
 Shu & Nakayama (2017) Shu, R. and Nakayama, H. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068, 2017.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.

Szegedy et al. (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2818–2826, 2016.  Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 Zhang et al. (2013) Zhang, L., Zhang, Y., Tang, J., Lu, K., and Tian, Q. Binary code ranking with weighted hamming distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1593, 2013.
Comments
There are no comments yet.