1 Introduction
Due to the power law distribution of word frequencies, rare words are extremely common in any language (Zipf, 1935). Yet, the majority of language generation tasks—including machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Luong et al., 2015), summarization (Rush et al., 2015; See et al., 2017; Paulus et al., 2018), dialogue generation (Vinyals & Le, 2015), question answering (Yin et al., 2015), speech recognition (Graves et al., 2013; Xiong et al., 2017), and others—generate words by sampling from a multinomial distribution over a closed output vocabulary. This is done by computing scores for each candidate word and normalizing them to probabilities using a softmax layer.
Since softmax is computationally expensive, current systems limit their output vocabulary to a few tens of thousands of most frequent words, sacrificing linguistic diversity by replacing the long tail of rare words by the unknown word token, unk. Unsurprisingly, at test time this leads to an inferior performance when generating rare or outofvocabulary words. Despite the fixed output vocabulary, softmax is computationally the slowest layer. Moreover, its computation follows a large matrix multiplication to compute scores over the candidate words; this makes softmax expensive in terms of memory requirements and the number of parameters to learn (Mnih & Kavukcuoglu, 2013; Morin & Bengio, 2005; de Brébisson & Vincent, 2016). Several alternatives have been proposed for alleviating these problems, including samplingbased approximations of the softmax function (Bengio & Senecal, 2003; Mnih & Kavukcuoglu, 2013), approaches proposing a hierarchical structure of the softmax layer (Morin & Bengio, 2005; Chen et al., 2016), and changing the vocabulary to frequent subword units, thereby reducing the vocabulary size (Sennrich et al., 2016).
We propose a novel technique to generate lowdimensional continuous word representations, or word embeddings (Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017) instead of a probability distribution over the vocabulary at each output step. We train sequencetosequence models with continuous outputs by minimizing the distance between the output vector and the pretrained word embedding of the reference word. At test time, the model generates a vector and then searches for its nearest neighbor in the target embedding space to generate the corresponding word. This general architecture can in principle be used for any language generation (or any recurrent regression) task. In this work, we experiment with neural machine translation, implemented using recurrent sequencetosequence models (Sutskever et al., 2014) with attention (Bahdanau et al., 2014; Luong et al., 2015).
To the best of our knowledge, this is the first work that uses word embeddings—rather than the softmax layer—as outputs in language generation tasks. While this idea is simple and intuitive, in practice, it does not yield competitive performance with standard regression losses like . This is because
loss implicitly assumes a Gaussian distribution of the output space which is likely false for embeddings. In order to correctly predict the outputs corresponding to new inputs, we must model the correct probability distribution of the target vector conditioned on the input
(Bishop, 1994). A major contribution of this work is a new loss function based on defining such a probability distribution over the word embedding space and minimizing its negative log likelihood (§
3).We evaluate our proposed model with the new loss function on the task of machine translation, including on datasets with huge vocabulary sizes, in two language pairs, and in two data domains (§4). In §5 we show that our models can be trained up to 2.5x faster than softmaxbased models while performing on par with stateoftheart systems in terms of generation quality. Error analysis (§6) reveals that the models with continuous outputs are better at correctly generating rare words and make errors that are close to the reference texts in the embedding space and are often semanticallyrelated to the reference translation.
2 Background
Traditionally, all sequence to sequence language generation models use onehot representations for each word in the output vocabulary . More formally, each word is represented as a unique vector , where is the size of the output vocabulary and only one entry (corresponding the word ID of in the vocabulary) in is and the rest are set to . The models produce a distribution over the output vocabulary at every step using the softmax function:
(1) 
where, is the score of the word given the hidden state produced by the LSTM cell (Hochreiter & Schmidhuber, 1997) at time step . and are trainable parameters. is the size of the hidden layer .
These parameters are trained by minimizing the negative loglikelihood (aka crossentropy) of this distribution by treating as the target distribution. The loss function is defined as follows:
This loss computation involves a normalization proportional to the size of the output vocabulary
. This becomes a bottleneck in natural language generation tasks where the vocabulary size is typically tens of thousands of words. We propose to address this bottleneck by representing words as continuous word vectors instead of onehot representations and introducing a novel probabilistic loss to train these models as described in §
3.^{2}^{2}2There is prior work on predicting word embeddings, but not in conditional language generation with seq2seq. Given a word embedding dictionary, Pinter et al. (2017) train a characterlevel neural net that learns to approximate the embeddings. It can then be applied to infer embeddings in the same space for words that were not available in the original set. These models were trained using the loss. Here, we briefly summarize prior work that aimed at alleviating the sofmax bottleneck problem.2.1 Softmax Alternatives
We briefly summarize existing modifications to the sofmax layer, capitalizing on conceptually different approaches.
SamplingBased Approximations
Samplingbased approaches completely do away with computing the normalization term of softmax by considering only a small subset of possible outputs. These include approximations like Importance Sampling (Bengio & Senecal, 2003)
, Noise Constrastive Estimation
(Mnih & Kavukcuoglu, 2013), Negative Sampling (Mikolov et al., 2013), and Blackout (Ji et al., 2015). These alternatives significantly speedup training time but degrade generation quality.Structural Approximations
Morin & Bengio (2005) replace the flat softmax layer with a hierarchical layer in the form of a binary tree where words are at the leaves. This alleviates the problem of expensive normalization, but these gains are only obtained at training time. At test time, the hierarchical approximations lead to a drop in performance compared to softmax both in time efficiency and in accuracy. Chen et al. (2016) propose to divide the vocabulary into clusters based on their frequencies. Each word is produced by a different part of the hidden layer making the output embedding matrix much sparser. This leads to performance improvement both in training and decoding. However, it assigns fewer parameters to rare words which leads to inferior performance in predicting them (Ruder, 2017).
Self Normalization Approaches
Andreas et al. (2015); Devlin et al. (2014) add additional terms to the training loss which makes the normalization factor close to 1, obviating the need to explicitly normalize. The evaluation of certain words can be done much faster than in softmax based models which is extremely useful for tasks like language modeling. However, for generation tasks, it is necessary to ensure that the normalization factor is exactly 1 which might not always be the case, and thus it might require explicit normalization.
SubwordBased Methods
Jozefowicz et al. (2016) introduce characterbased methods to reduce vocabulary size. While characterbased models lead to significant decrease in vocabulary size, they often differentiate poorly between similarly spelled words with different meanings. Sennrich et al. (2016) find a middle ground between characters and words based on subword units obtained using Byte Pair Encoding (BPE). Despite its limitations (Oda et al., 2017), BPE achieves good performance while also making the model truly open vocabulary. BPE is the stateofthe art approach currently used in machine translation. We thus use this as a baseline in our experiments.
3 Language Generation with Continuous Outputs
In our proposed model, each word type in the output vocabulary is represented by a continuous vector where . This representation can be obtained by training a word embedding model on a large monolingual corpus (Mikolov et al., 2013; Pennington et al., 2014; Bojanowski et al., 2017).
At each generation step, the decoder of our model produces a continuous vector . The output word is then predicted by searching for the nearest neighbor of in the embedding space:
where is the output vocabulary, is a distance function. In other words, the embedding space could be considered to be quantized into components and the generated continuous vector is mapped to a word based on the quanta in which it lies. The mapped word is then passed to the next step of the decoder (Gray, 1990). While training this model, we know the target vector , and minimize its distance from the output vector . With this formulation, our model is directly trained to optimize towards the information encoded by the embeddings. For example, if the embeddings are primarily semantic, as in Mikolov et al. (2013) or Bojanowski et al. (2017), the model would tend to output words in a semantic space, that is produced words would either be correct or close synonyms (which we see in our analysis in §6), or if we use synacticosemantic embeddings (Levy & Goldberg, 2014; Ling et al., 2015), we might be able to also control for syntatic forms.
We propose a novel probabilistic loss function—a probabilistic variant of cosine loss—which gives a theoretically grounded regression loss for sequence generation and addresses the limitations of existing empirical losses (described in §4.2). Cosine loss measures the closeness between vector directions. A natural choice for estimating directional distributions is von MisesFisher (vMF) defined over a hypersphere of unit norm. That is, a vector close to the mean direction will have high probability. VMF is considered the directional equivalent of Gaussian distribution ^{3}^{3}3A natural choice for many regression tasks would be to use a loss function based on Gaussian distribution itself which is a probabilistic version of loss. But as we describe in §4.2, is not considered a suitable loss for regression on embedding spaces. Given a target word , its density function is given as follows:
where and are vectors of dimension with unit norm, is a positive scalar, also called the concentration parameter.
defines a uniform distribution over the hypersphere and
defines a point distribution at . is the normalization term:where is called modified Bessel function of the first kind of order . The output of the model at each step is a vector of dimension . We use . Thus the density function becomes:
(2) 
It is noteworthy that equation 2 is very similar to softmax computation (except that is a unit vector), the main difference being that normalization is not done by summing over the vocabulary, which makes it much faster than the softmax computation. More details about it’s computation are given in the appendix.
The negative loglikelihood of the vMF distribution, which at each output step is given by:
Regularization of NLLvMF
In practice, we observe that the NLLvMF loss puts too much weight on increasing , making the second term in the loss function decrease rapidly without significant decrease in the cosine distance. To account for this, we add a regularization term. We experiment with two variants of regularization.
 :

We add to the loss function, where
is a scalar hyperparameter.
^{4}^{4}4We empirically set in all our experiments This makes intuitive sense in that the length of the output vector should not increase too much. The regularized loss function is as follows:  :

We modify the previous loss function as follows:
(3) decreases slowly as increases as compared the second term. Adding a the second term controls for how fast it can decrease.^{5}^{5}5We use in all our experiments
4 Experiments
4.1 Experimental Setups
We modify the standard seq2seq models in OpenNMT^{6}^{6}6http://opennmt.net/
in PyTorch
^{7}^{7}7https://pytorch.org/ (Klein et al., 2017) to implement the architecture described in §3. This model has a bidirectional LSTM encoder with an attentionbased decoder (Luong et al., 2015). The encoder has one layer whereas the decoder has 2 layers of size 1024 with the input word embedding size of 512. For the baseline systems, the output at each decoder step multiplies a weight matrix () followed by softmax. This model is trained until convergence on the validation perplexity. For our proposed models, we replace the softmax layer with the continuous output layer () where the outputs are dimensional. We empirically choose for all our experiments. Additional hyperparameter settings can be found in the appendix. These models are trained until convergence on the validation loss. Out of vocabulary words are mapped to an unk token^{8}^{8}8Although the proposed model can make decoding open vocabulary, there could still be unknown words, e.g., words for which we do not have pretrained embeddings; we need unk token to represent these words. We assign unk an embedding equal to the average of embeddings of all the words which are not present in the target vocabulary of the training set but are present in vocabulary on which the word embeddings are trained. Following Denkowski & Neubig (2017), after decoding a postprocessing step replaces the unk token using a dictionary lookup of the word with highest attention score. If the word does not exist in the dictionary, we back off to copying the source word itself. Bilingual dictionaries are automatically extracted from our parallel training corpus using word alignment (Dyer et al., 2013)^{9}^{9}9https://github.com/clab/fast_align. We evaluate all the models on the test data using the BLEU score (Papineni et al., 2002).We evaluate our systems on standard machine translation datasets from IWSLT’16 (Cettolo et al., 2016), on two target languages, English: GermanEnglish, FrenchEnglish and a morphologically richer language French: EnglishFrench. The training sets for each of the language pairs contain around 220,000 parallel sentences. We use TED Test 2013+2014 (2,300 sentence pairs) as developments sets and TED Test 2015+2016 (2,200 sentence pairs) as test sets respectively for all the language pairs. All mentioned setups have a total vocabulary size of around 55,000 in the target language of which we choose top 50,000 words by frequency as the target vocabulary^{10}^{10}10Removing the bottom 5,000 words did not make a significant difference in terms of translation quality.
We also experiment with a much larger WMT’16 GermanEnglish (Bojar et al., 2016) task whose training set contains around 4.5M sentence pairs with the target vocabulary size of around 800,000. We use newstest2015 and newstest2016 as development and test data respectively. Since with continuous outputs we do not need to perform a time consuming softmax computation, we can train the proposed model with very large target vocabulary without any change in training time per batch. We perform this experiment with WMT’16 de–en dataset with a target vocabulary size of 300,000 (basically all the words in the target vocabulary for which we had trained embeddings). But to able to produce these words, the source vocabulary also needs to be increased to have their translations in the inputs, which would lead to a huge increase in the number of trainable parameters. Instead, we use subwords computed using BPE as source vocabulary. We use 100,000 merge operations to compute the source vocabulary as we observe using a smaller number leads to too small (and less meaningful) subword units which are difficult to align with target words.
Both of these datasets contain examples from vastly different domains, while IWSLT’16 contains less formal spoken language, WMT’16 contains data primarily from news.
We train target word embeddings for English and French on corpora constructed using WMT’16 (Bojar et al., 2016) monolingual datasets containing data from Europarl, News Commentary, News Crawl from 2007 to 2015 and News Discussion (everything except Common Crawl due to its large memory requirements). These corpora consist of 4B+ tokens for English and 2B+ tokens for French. We experiment with two embedding models: word2vec Mikolov et al. (2013) and fasttext Bojanowski et al. (2017) which were trained using the hyperparameters recommended by the authors.
4.2 Empirical Loss Functions
We compare our proposed loss function with standard loss functions used in multivariate regression.
Squared Error () is the most common distance function used when the model outputs are continuous (Lehmann & Casella, 1998). For each target word , it is given as
penalizes large errors more strongly and therefore is sensitive to outliers. To avoid this we use a square rooted version of
loss. But it has been argued that there is a mismatch between the objective function used to learn word representations (maximum likelihood based on inner product), the distance measure for word vectors (cosine similarity), and
distance as the objective function to learn transformations of word vectors (Xing et al., 2015). This argument prompts us to look at cosine loss.Cosine Loss is given as . This loss minimizes the distance between the directions of output and target vectors while disregarding their magnitudes. The target embedding space in this case becomes a set of points on a hypersphere of dimension with unit radius.
Max Margin Loss Lazaridou et al. (2015) argue that using pairwise losses like or cosine distance for learning vectors in high dimensional spaces leads to hubness: word vectors of a subset of words appear as nearest neighbors of many points in the output vector space. To alleviate this, we experiment with a marginbased ranking loss (which has been shown to reduce hubness) to train the model to rank the word vector prediction for target vector higher than any other word vector in the embedding space. where, is a hyperparameter^{11}^{11}11We use in our experiments. representing the margin and denotes negative examples. We use only one informative negative example as described in Lazaridou et al. (2015) which is closest to and farthest from the target word vector . But, searching for this negative example requires iterating over the vocabulary which brings back the problem of slow loss computation.
4.3 Decoding
In the case of empirical losses, we output the word whose target embedding is the nearest neighbor to the vector in terms of the distance (loss) defined. In the case of NLLvMF, we predict the word whose target embedding has the highest value of vMF probability density wrt to the output vector. This predicted word is fed as the input for the next time step. Our nearestneighbor decoding scheme is equivalent to a greedy decoding; we thus compare to baseline models with beam size of 1.
4.4 Tying the target embeddings
Until now we discussed the embeddings in the output layer. Additionally, decoder in a sequencetosequence model has an input
embedding matrix as the previous output word is fed as an input to the decoder. Much of the size of the trainable parameters in all the models is occupied by these input embedding weights. We experiment with keeping this embedding layer fixed and tied with pretrained target output embeddings
(Press & Wolf, 2016). This leads to significant reduction in the number of parameters in our model.5 Results



Loss  BLEU  
fr–en  de–en  en–fr  
  no  wordword  CE  31.0  24.7  29.3  
  no  wordBPE  CE  29.1  24.1  29.8  
  no  BPEBPE  CE  31.4  25.8  31.0  
word2vec  no  wordemb  L2  27.2  19.4  26.4  
word2vec  no  wordemb  Cosine  29.1  21.9  26.6  
word2vec  no  wordemb  MaxMargin  29.6  21.4  26.7  
fasttext  no  wordemb  MaxMargin  31.0  25.0  29.0  
fasttext  yes  wordemb  MaxMargin  32.1  25.0  31.0  
word2vec  no  wordemb  29.5  22.7  26.6  
word2vec  no  wordemb  29.7  21.6  26.7  
word2vec  yes  wordemb  29.7  22.2  27.5  
fasttext  no  wordemb  30.4  23.4  27.6  
fasttext  yes  wordemb  32.1  25.1  31.7 
Translation Quality
Table 1 shows the BLEU scores on the test sets for several baseline systems, and various configurations including the types of losses, types of inputs/outputs used (word, BPE, or embedding)^{12}^{12}12Note that we do not experiment with subword embeddings since the number of merge operations for BPE usually depend on the choice of a language pair which would require the embeddings to be retrained for every language pair. and whether the model used tied embeddings in the decoder or not.
loss attains the lowest BLEU scores among the proposed models; our manual error analysis reveals that the high error rate is due to the hubness phenomenon, as we described in §4.2. The BLEU scores improve for cosine loss, confirming the argument of Xing et al. (2015) that cosine distance is a better suited similarity (or distance) function for word embeddings. Best results—for MaxMargin and NLLvMF losses—surpass the strong BPE baseline in translation FrenchEnglish and EnglishFrench, and attain slightly lower but competitive results on GermanEnglish.
Since we represent each target word by its embedding, the quality of embeddings should have an impact on the translation quality. We measure this by training our best model with fasttext embeddings (Bojanowski et al., 2017), which leads to BLEU improvement. Tied embeddings are the most effective setups: they not only achieve highest translation quality, but also dramatically reduce parameters requirements and the speed of convergence.
Table 2 shows results on WMT’16 test set in terms of BLEU and METEOR (Denkowski & Lavie, 2014) trained only for bestperforming setups in table 1. METEOR uses paraphrase tables and WordNet synonyms for common words. This may explain why METEOR scores, unlike BLEU, close the gap with the baseline models: as we found in the qualitative analysis of outputs, our models often output synonyms of the reference words, which are plausible translations but are penalized by BLEU. ^{13}^{13}13In IWSLT’16 datasets we obtain similar performances in BLEU and METEOR, this is likely because those models perform better particularly in translating rare words (§6) which are not covered in METEOR resources. Examples are included in the Appendix.
Loss  BLEU  METEOR 

CE  22.9  23.9 
CE (BPE)  30.1  28.7 
MaxMargin  24.3  25.2 
28.8  28.2 
Training Time
Table 4 shows the average training time per batch. In figure 1 (left), we show how many samples per second our proposed model can process at training time compared to the baseline. As we increase the batch size, the gap between the baseline and the proposed models increases. Our proposed models can process large minibatches while still training much faster than the baseline models. The largest minibatch size with which we can train our model is 512, compared to 184 in the baseline model. Using maxmargin loss leads to a slight increase in the training time compared to NLLvMF. This is because its computation needs a negative example which requires iterating over the entire vocabulary. Since our model requires lookup of nearest neighbors in the target embedding table while testing, it currently takes similar time as that of softmaxbased models. In future work, approximate nearest neighbors algorithms Johnson et al. (2017) can be used to improve translation time.
We also compare the speed of convergence, using BLEU scores on dev data. In figure 1 (right), we plot the BLEU scores against the number of epochs. Our model convergences much faster than the baseline models leading to an even larger improvement in overall training time (Similar figures for more datasets can be found in the appendix). As a result, as shown in table 3, the total training time of our proposed model (until convergence) is less than upto 2.5x of the total training time of the baseline models.





fr–en  4h  4.5h  1.9h  
de–en  3h  3.5h  1.5h  
en–fr  1.8h  2.8h  1.3  
WMT de–en  4.3d  4.5d  1.6d 
Memory Requirements
As shown in Table 4 our best performing model requires less than 1% of the number of parameters in input and output layers, compared to BPEbased baselines.

Tied  Loss 





word  No  CE  25.6M (1.0x)  51.2M (1.0x)  400 (1.0x)  
BPE  No  CE  8.192M (0.32x)  16.384M (0.32x)  346 (0.86x)  
emb  No  L2  25.6M (1.0x)  307.2K (0.006x)  160 (0.4x)  
emb  No  Cosine  25.6M (1.0x)  307.2K (0.006x)  160 (0.4x)  
emb  No  MaxMargin  25.6M (1.0x)  307.2K (0.006x)  178 (0.43x)  
emb  Yes  MaxMargin  153.6K (0.006x)  307.2K (0.006x)  178 (0.43x)  
emb  No  25.6M (1.0x)  307.2K (0.006x)  170 (0.42x)  
emb  Yes  153.6K (0.006x)  307.2K (0.006x)  170 (0.42x) 
6 Error Analysis
Translation of Rare Words
We evaluate the translation accuracy of words in the test set based on their frequency in the training corpus. Table 5 shows how the score varies with the word frequency. score gives a balance between recall (the fraction of words in the reference that the predicted sentence produces right) and precision (the fraction of produced words that are in reference). We show substantial improvements over softmax and BPE baselines in translating less frequent and rare words, which we hypothesize is due to having learned good embeddings of such words from the monolingual target corpus where these words are not as rare. Moreover, in BPE based models, rare words on the source side are split in smaller units which are in some cases not properly translated in subword units on the target side if transparent alignments don’t exist. For example, the word saboter in French is translated to sab+ot+tate by the BPE model whereas correctly translated as sabotage by our model. Also, a rare word retraite in French in translated to pension by both Softmax and BPE models (pension is a related word but less rare in the corpus) instead of the expected translation retirement which our model gets right.


BPE 




1  0.42  0.50  0.30  0.52  
2  0.16  0.26  0.25  0.31  
3  0.14  0.22  0.25  0.33  
4  0.29  0.24  0.30  0.33  
59  0.28  0.33  0.38  0.37  
1099  0.54  0.53  0.53  0.55  
100999  0.60  0.61  0.60  0.60  
1000+  0.69  0.70  0.69  0.69 
We conducted a thorough analysis of outputs across our experimental setups. Few examples are shown in the appendix. Interestingly, there are many examples where our models do not exactly match the reference translations (so they do not benefit from in terms of BLEU scores) but produce meaningful translations. This is likely because the model produces nearby words of the target words or paraphrases instead of the target word (which are many times synonyms).
Since we are predicting embeddings instead of actual words, the model tends to be weaker sometimes and does not follow a good language model and leads to ungrammatical outputs in cases where the baseline model would perform well. Integrating a pretrained language model within the decoding framework is one potential avenue for our future work. Another reason for this type of errors could be our choice of target embeddings which are not modeled to (explicitly) capture syntactic relationships. Using syntactically inspired embeddings (Levy & Goldberg, 2014; Ling et al., 2015) might help reduce these errors. However, such fluency errors are not uncommon also in softmax and BPEbased models either.
7 Conclusion
This work makes several contributions. We introduce a novel framework of sequence to sequence learning for language generation using word embeddings as outputs. We propose new probabilistic loss functions based on vMF distribution for learning in this framework. We then show that the proposed model trained on the task of machine translation leads to reduction in trainable parameters, to faster convergence, and a dramatic speedup, up to 2.5x in training time over standard benchmarks. Table 6 visualizes a comparison between different types of softmax approximations and our proposed method.
Stateoftheart results in softmaxbased models are highly optimized after a few years on research in neural machine translation. The results that we report are comparable or slightly lower than the strongest baselines, but these setups are only an initial investigation of translation with the continuous output layer. There are numerous possible directions to explore and improve the proposed setups. What are additional loss functions? How to setup beam search? Should we use scheduled sampling? What types of embeddings to use? How to translate with the embedding output into morphologicallyrich languages? Can lowresource neural machine translation benefit from translation with continuous outputs if large monolingual corpora are available to pretrain strong targetside embeddings? We will explore these questions in future work.
Furthermore, the proposed architecture and the probabilistic loss (NLLvMF) have the potential to benefit other applications which have sequences as outputs, e.g. speech recognition. NLLvMF could be used as an objective function for problems which currently use cosine or distance, such as learning multilingual word embeddings. Since the outputs of our models are continuous (rather than classbased discrete symbols), these models can potentially simplify training of generative adversarial networks for language generation.



Emb w/ NLLvMF  




Accuracy  
Parameters  

References
 Andreas et al. (2015) Jacob Andreas, Maxim Rabinovich, Michael I. Jordan, and Dan Klein. On the accuracy of selfnormalized loglinear models. In Proc NIPS, 2015.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proc. ICLR, 2014.
 Bengio & Senecal (2003) Yoshua Bengio and JeanSébastien Senecal. Quick training of probabilistic neural nets by importance sampling. In Proc. AISTATS, 2003.
 Bishop (1994) Christopher M. Bishop. Mixture density networks. Technical report, 1994.
 Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 2017.
 Bojar et al. (2016) Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. In Proc. WMT, pp. 131–198, 2016.
 Cettolo et al. (2016) M Cettolo, J Niehues, S Stüker, L Bentivogli, R Cattoni, and M Federico. The IWSLT 2016 evaluation campaign. In Proc. IWSLT, 2016.
 Chen et al. (2016) Wenlin Chen, David Grangier, and Michael Auli. Strategies for training large vocabulary neural language models. In Proc. ACL, 2016.
 de Brébisson & Vincent (2016) Alexandre de Brébisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. In Proc. ICLR, 2016.
 Denkowski & Lavie (2014) Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proc. EACL 2014 Workshop on Statistical Machine Translation, 2014.
 Denkowski & Neubig (2017) Michael J. Denkowski and Graham Neubig. Stronger baselines for trustable results in neural machine translation. CoRR, abs/1706.09733, 2017.
 Devlin et al. (2014) Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. In Proc. ACL, 2014.
 Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterization of IBM Model 2. In Proc. NAACL, 2013.

Graves et al. (2013)
Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton.
Speech recognition with deep recurrent neural networks.
In Proc. ICASSP, pp. 6645–6649, 2013.  Gray (1990) Robert M. Gray. Vector quantization. In Readings in Speech Recognition. 1990.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9:1735–1780, 1997.
 Ji et al. (2015) Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, and Pradeep Dubey. Blackout: Speeding up recurrent neural network language models with very large vocabularies. CoRR, 2015. URL http://arxiv.org/abs/1511.06909.
 Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billionscale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
 Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling, 2016. URL https://arxiv.org/pdf/1602.02410.pdf.
 Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. Opennmt: Opensource toolkit for neural machine translation. In Proc. ACL, 2017.
 Lazaridou et al. (2015) Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. Hubness and pollution: Delving into crossspace mapping for zeroshot learning. In Proc. ACL, 2015.
 Lehmann & Casella (1998) E.L. Lehmann and G. Casella. Theory of Point Estimation. Springer Verlag, 1998. ISBN 0387985026.
 Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Dependencybased word embeddings. In Proc. ACL. The Association for Computer Linguistics, 2014.
 Ling et al. (2015) Wang Ling, Chris Dyer, Alan Black, and Isabel Trancoso. Two/too simple adaptations of word2vec for syntax problems. In Proc. NAACLHLT 2015, 2015.
 Luong et al. (2015) MinhThang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In Proc. EMNLP, 2015.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Proc. NIPS. 2013.

Mnih & Kavukcuoglu (2013)
Andriy Mnih and Koray Kavukcuoglu.
Learning word embeddings efficiently with noisecontrastive estimation.
In Proc. NIPS, pp. 2265–2273, 2013.  Morin & Bengio (2005) Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G. Cowell and Zoubin Ghahramani (eds.), Proc. AISTATS, pp. 246–252, 2005.
 Oda et al. (2017) Yusuke Oda, Philip Arthur, Graham Neubig, Koichiro Yoshino, and Satoshi Nakamura. Neural machine translation via binary code prediction. In Proc. ACL, 2017.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. bleu: a method for automatic evaluation of machine translation. In Proc. ACL, pp. 311–318, 2002.
 Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. In Proc. ICLR, 2018.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proc. EMNLP, 2014.
 Pinter et al. (2017) Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. Mimicking word embeddings using subword rnns. In Proc. EMNLP, 2017.
 Press & Wolf (2016) Ofir Press and Lior Wolf. Using the output embedding to improve language models. CoRR, abs/1608.05859, 2016.
 Ruder (2017) Sebastian Ruder. A survey of crosslingual embedding models. CoRR, 2017.
 RuizAntolín & Segura (2016) Diego RuizAntolín and Javier Segura. A new type of sharp bounds for ratios of modified bessel functions. Journal of Mathematical Analysis and Applications, 2016.

Rush et al. (2015)
Alexander M. Rush, Sumit Chopra, and Jason Weston.
A neural attention model for abstractive sentence summarization.
In Proc. EMNLP, 2015.  See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointergenerator networks. In Proc. ACL, pp. 1073–1083, 2017.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proc. ACL, 2016.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112, 2014.
 Vinyals & Le (2015) Oriol Vinyals and Quoc V. Le. A neural conversational model. CoRR, abs/1506.05869, 2015.
 Xing et al. (2015) Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In HLTNAACL, 2015.
 Xiong et al. (2017) Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Michael L Seltzer, Andreas Stolcke, Dong Yu, and Geoffrey Zweig. Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2410–2423, 2017.
 Yin et al. (2015) Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. Neural generative question answering. CoRR, abs/1512.01337, 2015.
 Zipf (1935) George Kingsley Zipf. The psychobiology of language. 1935.
8 Appendix
8.1 Hyperparameter and Infrastructure Details
8.2 Gradient Computation for NLLvMF loss
NLLvMF loss is given as
where is given as:
The normalization constant is not directly differentiable because Bessel function cannot be written in a closed form. The gradient of the first component () of the loss is given as
This involves two computations of Bessel function () for for which we use scipy.special.ive. For high values of ^{14}^{14}14for , we don’t face this issue, but it is useful if one is using embeddings of higher dimensions and low values of , the values of the Bessel function can become really small and lead to underflow (but the gradient is still large). To deal with underflow, the gradient value could be approximated with it’s (tight) lower bound (RuizAntolín & Segura, 2016),
That is, in the initial steps of training, one might need to use to the approximation of the gradient to train the model and switch to the actual computation later on. One could also approximate the value of by integrating over the approximate gradient value which is given as
In practice, we see that replacing with this approximation in the loss function gives similar performance on the test data as well alleviates the problem of underflow. We thus recommend using it.
8.3 Translation Quality and Performance: Additional Results
Figure 2 shows the convergence time results for more IWSLT datasets. The results shown are averaged over multiple runs, and are in line with results reported in Figure 1.
In Table 1, we present results of translation quality with our proposed model and comparable baselines with a beam size of one. Here, for completeness, table 9 shows additional results with softmaxbased models with a beam size of 5.
Loss  BLEU 

IWSLT fr–en  32.2 
IWSLT de–en  26.1 
IWSLT en–fr  32.4 
WMT de–en  31.9 
With our proposed models, in principle, it is possible to generate candidates for beam search by using Nearest Neighbors. But how to rank the partially generated sequences is not trivial (one could use the loss values themselves to rank, but initial experiments with this setting did not result in significant gains). In this work, we focus on enabling training with continuous outputs efficiently and accurately giving us huge gains in training time. The question of decoding with beam search requires substantial investigation and we leave it for future work.
8.4 Sample Translations from Test Sets
Input 



Reference 
















Input 



Reference 















