1 Introduction
With neural networks, tasks such as language modeling (LM) and machine translation (MT) are generally approached by factorizing the target sentence probability into products of target word posterior probabilities
[bengio2003neural, sundermeyer2012lstm, sutskever2014sequence]. In order to classify over the target vocabulary, it is necessary to compute a context vector, learn a projection matrix and normalize the similarity scores between the two probabilities. While various model architectures are proposed to calculate the context vectors
[bahdanau2014neural, gehring2017convolutional, vaswani2017attention], most of them use a softmax layer with the inner product function to compute the word posterior probabilities. [yang2017breaking] identify a shortcoming with the formulation above which they call the “Softmax Bottleneck”. The problem lies in the exponentialandlogarithm calculation when using the crossentropy criterion, which results in a lowrank log word posterior probability matrix. Hypothesizing that natural language is highrank, the authors argue that the “Softmax Bottleneck” is a limiting factor of the expressiveness of the models. One natural thought on this problem is its similarity to the lack of expressiveness of a logistic regression model or a support vector machine (SVM) with a simple linear kernel.By implicitly transforming data points into high dimensional feature spaces, kernels can increase the expressiveness of the classifier and allow for more complex decision boundaries [bishop2006prml]. Note that a kernel is deemed valid when it corresponds to a scalar product in some feature space [bishop2006prml], or its corresponding Gram matrix is positive semidefinite [shawe2004kernel]. Yet empirical results [lin2003study, boughorbel2005conditionally] also show that conditionally positive semidefinite kernels can perform well in some applications. In this work, we do not enforce the positive semidefiniteness of kernels.
Motivated to examine the performances of various kernels in LM and MT, we structure this work as follows:

[nolistsep]

We implement individual kernels in replacement of the inner product function in the softmax layer and test them on LM and MT tasks.

We look at the gradient properties of several kernels and analyze the observed performance differences.

We investigate various mixtures of kernels.

We further examine and compare the disambiguation abilities of the linear kernel and a mixture of kernels.
2 Related Work
The softmax layer with the inner product similarity function has limits in terms of expressiveness: [yang2017breaking] identify the “Softmax Bottleneck”, demonstrating its incapabilities to represent arbitrary target distributions. As a solution, they propose the “MixtureofSoftmaxes” (MoS) architecture. [kanai2018sigsoftmax]
reanalyze the problem and suggest to include an extra sigmoid function in the softmax formula.
[herold2018improving] develop weight norm initialization and normalization methods on top of MoS. [takase2018direct] extend the architecture and introduce a regularization term to encourage equal contributions of mixture components.Kernels are generally considered to be a family of energy functions, which can implicitly map data points into high dimensional spaces, allowing for the learning of complex decision boundaries [bishop2006prml]. [scholkopf2002learning] provide detailed and extensive information on the topic of learning with kernels using SVMs. [souza2010kernel] curates an incomplete list of popular kernels. [zhu2002kernel] build on kernel logistic regression and develop a classification algorithm called import vector machine. [cho2009kernel] explore the use of arccosine kernels in a multilayer nonlinear transformation setup. [memisevic2010gated] introduce a vector of binary latent variables and propose to use a bilinear scoring function in the softmax.
In pursuit of more powerful word representations, [vilnis2014word] and [athiwaratkun2017multimodal]
propose to embed words into Gaussian distributions to better capture entailment properties and multiple meanings of words.
[nickel2017poincare] show that an dimensional Poincaré ball is a suitable space, in which one can embed words to better represent hierarchies. [dhingra2018embedding] describe a reparametrization trick to automate the process of renormalizing word vector norms.3 Methodology
3.1 Generalized Softmax
According to [kanai2018sigsoftmax]
, because of the “logarithm of exponential” calculation in the “softmax and cross entropy” setup, the nonlinearity of the logarithm of the activated logit is a prerequisite to break the “Softmax Bottleneck”. While the paper presents a Sigsoftmax activation function applied on logits calculated with inner products, we explore many nonlinear kernel functions for the logit calculation, including the ones traditionally used in SVMs.
Specifically, we use a generalized softmax layer
(1) 
with being the th column of the projection matrix and being the th transformed context vector. is shared across mixture components and each component uses kernel to calculate the logits. Both the mixture weight
(2) 
and the transformed context vector
(3) 
depend on the original context vector . In this setup, matrix , and are all trainable model parameters, where is the hidden dimension size.
There are two main motivations behind this generalized setup: first, by mapping to , we hope to transform the context vector into the respective feature space and generate different logit distributions over the vocabulary; second, by explicitly conditioning on , we hope the model is able to select which kernel is more appropriate for each context. Note that, in Equation 1, does not have a subscript of , which means we tie the projection matrices across the kernels. This greatly limits the expressiveness of our model, but is a compromise because of memory limitations.
3.2 Individual Kernels
In total, we implement and experiment with 9 individual kernels – linear (lin), logarithm (log), power (pow), polynomial (pol
rbf), symmetric spherical Gaussian (ssg) [vilnis2014word], symmetric spherical mixtures of Gaussian (mog) [athiwaratkun2017multimodal], nonparametric hyperbolic (hpb) [dhingra2018embedding] and wavelet (wav) [zhang2004wavelet]:(4)  
(5)  
(6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12) 
These individual kernels can all be thought of as energy functions between the context vector and the word vector . Because of the exponential calculation outside of the logit calculation, these kernels may result in numerically unstable computations. For example, using the rbf kernel results in an exponentialofexponential operation, which easily blows up when and are distant. We nonetheless implement and examine the properties of these kernels.
Additionally, the memory consumption may blow up when using certain kernels. This is because the dimension reduction step in common to all kernels may not always be immediately executable. In this case, all pairwise similarities/distances between the context vectors and the word vectors have to be cached. To reduce memory usage, we apply several tricks: 1. use spherical covariance matrices, 2. simplify the wavelet kernel and 3. rewrite the formula of the power of the vector difference norm
(13) 
which also suggests that can be thought of as a vector norm regularized version of the inner product.
4 Experiments
4.1 Experimental Setup
In this work, two datasets are used: Switchboard (SWB) for LM and IWSLT 2014 GermanEnglish (IWSLT) for MT. SWB is a relatively small dataset, with a vocabulary size of 30k and a training token count of 25M. For SWB, we use a standard 2layer LSTM to generate context vectors, with 512 hidden dimensions and 0.1 dropout on the embedded word vectors. For IWSLT, we follow the setup in [edunov2017classical], using 160k parallel training sentences and 10k joint BPE merge operations. The transformer architecture is used to produce context vectors. We use 512 hidden dimensions in the encoder and decoder stacks, 1024 hidden dimensions in the fullyconnected layers and 4 attention heads. As in Equation 1
, the context vectors are compared with the word vectors in the projection matrices. Hyperparameters of the kernels are tuned with grid search to give the best performance on the development set. We vary
and to test various kernel settings. We use the Fairseq toolkit [ott2019fairseq] to conduct the experiments.4.2 Individual Kernels
The performances of models using individual kernels are summarized in Table 1. References from the literature are included to show the relative strengths of the kernels.
Method  SWB  IWSLT 

(PPL)  (Bleu)  
Ref.  [irie2018investigation]  [wu2019pay] 
47.6  35.2  
lin  46.8  34.3 
log  103.0  0.4 
pow  46.8  32.8 
pol  47.3  31.7 
rbf  284.9  0.0 
ssg  49.9  34.6 
mog  46.7  34.2 
hpb  122.6  0.3 
wav  289.7  0.0 
Compared to the lin kernel, all other individual kernels have the exact same number of parameters and comparable run time. The only difference lies in how the logits are calculated. On both datasets, we see consistent behavior. While lin serves as a reasonably good baseline, pow, pol, ssg and mog are on the same level of performance, even slightly outperforming lin in some cases (mog on SWB and ssg on IWSLT). log and hpb are worse, giving much higher perplexity (PPL) and values close to zero in Bleu [papineni2002bleu]. Among all 9 kernels, rbf and wav perform the worst.
4.3 Gradient Properties
As shown in Section 4.2, a wide range of performances is observed across different kernels. In order to understand why some kernels perform better than others, we select four simple kernels (rbf, wav, log and pow) and plot their function graphs in Figure 1.
All kernels have their maximum values at . In this case, the context vector is exactly the same as the word vector . The gradient properties, however, vary across these kernels. When far away from the optimum, pow has a constant nonzero gradient. On the other hand, rbf, wav and log have nearzero gradients. As approaches zero, the absolute gradient of log increases, while nonnegligible gradients show up in rbf and wav only when is close to zero. We think strong supervised signals in the gradients are helpful for model convergence. Considering these gradient properties, we expect performances among these kernels to be: rbf wav log pow. The results in Table 1 fits our expectations very well. This further suggests that when selecting and designing alternative kernels, the gradient properties across the domain of the parameters should be carefully considered.
4.4 Mixtures of Kernels
Inspired by the MoS approach, we train LMs combining the outputs of multiple kernels according to Equation 1 on SWB. Similar to [takase2018direct]
, we add the variance of the mixture weights, scaled by
and averaged over data , to the standard cross entropy loss:(14) 
Variance  PPL  

0.001  4.74  46.8 
0.01  4.98  46.6 
0.1  3.67  47.2 
1  3.81  47.4 
Name  Mixture settings  PPL 

mos  9lin  47.8 
mix  1 of each kernel  47.1 
mix  lin, log, rbf, hpb, wav  46.6 
mix  3lin, log  46.5 
mix  lin, log, pow, pol  47.3 
mix  lin, log, rbf, hpb  47.1 
mix  2lin, 2rbf  46.7 
Model  Prediction 

Ground Truth  … books can end up being outdated very quickly 
lin  … books can end up being outdated very soon 
mix  … books can end up being outdated very quickly 
Ground Truth  … if you vote for a republican or vote for a democrat 
lin  … if you vote for a republican or vote for a republican 
mix  … if you vote for a republican or vote for a democrat 
Performances of MoS systems for different values of are depicted in Table 2. We decide to run all mixture experiments with , as it seems to be a good compromise between regularization and performance.
The detailed mixture settings and perplexity results are summarized in Table 3. Specifically, we select “mos” to try to reproduce the “Softmax Bottleneck” paper [yang2017breaking] and “mix” to test a big mixture of each kernel. “mix”, “mix”, …, and “mix” are selected randomly to explore the kernel combination space. We also experiment with more mixture settings, but unfortunately with tied projection matrices, only those mixtures with the lin kernel give good performance. Note that weighted matrices are tied and multiple instances of the same kernel may be included in a mixture component. In this case, each mixture component is free in learning its own context vectors.
Compared to the individual kernels, the decoding speed of the mixture models is slowed down by a factor of two on average. The increased number of parameters because of context vector projection is negligible when the projection matrices are tied. As can be seen, all the mixture settings in Table 3 have similar performances to the simple lin setup in Table 1
. This is very likely because they all have at least one linear component, and the linear components consistently receive a total weight above 50%. So we conclude that mixtures of kernels using a shared projection matrix cannot significantly improve over the baseline. We find no fundamental difference between the opensourced ”MixtureofSoftmaxes” implementation
[yang2017breaking] and ours. Unfortunately, we can not replicate the results from the original paper. We do note that they use different datasets and include many more techniques like activation regularization and averaged SGD optimization.4.5 Disambiguation Abilities
In theory, there is a potential drawback of the lin kernel used together with the softmax layer. Consider when two words and are close syntactically and/or semantically. It is a common observation that their corresponding word vectors are also close together after successful training [mikolov2013distributed, mikolov2013efficient, le2014distributed]. In this case, for any context vector , the logits and will be similar as well. Although the alternative kernels studied here also suffer from this problem: when , with nonlinear activations the difference between the logits may be amplified, making it easier to disambiguate the words.
To show potentially better disambiguation properties of kernel mixtures, we take a more detailed look at the LM task. For the lin model, the projection matrix is extracted and the pairwise word distances are calculated using inner product. This is then used to extract word clusters in the embedding space. Two of the extracted clusters are: {quickly, slowly, soon, quick, easily} and {republicans, politicians, democrat, republican, democrats}. We suspect that it might be difficult for the lin model to distinguish words in these clusters, as their similarity scores are very close. It turns out that this is also what we observe when looking at the example sentences shown in Table 4. This suggests that even if diversifying the output layer with different kernels does not result in immediate improvements in terms of perplexity – a kernelmixturebased method may still be superior in other aspects.
5 Conclusion
Motivated by the similarity between the “Softmax Bottleneck” problem and the lack of expressiveness of a logistic regression model or an SVM with a simple linear kernel, we explore the use of kernel functions in the softmax layer for contextual word classification:

In replacement of the inner product function, kernels and mixtures of kernels are used in the softmax layer. Our experiments with 9 different individual kernels on LM and MT exhibit a wide range of performances, with lin, pol, pow, ssg and mog being the bestperforming ones.

Examining the gradient properties, we give reasons why some kernels perform better than others and argue that the gradient properties of a kernel function across the domain of the parameters is worthy of careful consideration.

In mixture settings consisting of at least one lin kernel, lin consistently receives a large weight.

While not significantly better than the lin kernel, we observe cases where the mixture model is better at disambiguating similar words.
In our mixture experiments, projection matrices are shared due to memory constraints. This greatly limits the expressiveness of the model. The next step is to untie the word embeddings across different kernels and allow for the learning of even more complex decision boundaries.
6 Acknowledgements
This work has received funding from the European Research Council (ERC) (under the European Union’s Horizon 2020 research and innovation programme, grant agreement No 694537, project ”SEQCLAS”) and the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/81, project ”CoreTec”). The GPU computing cluster was supported by DFG (Deutsche Forschungsgemeinschaft) under grant INST 222/11681 FUGG.