Log In Sign Up

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification

by   Yingbo Gao, et al.

Prominently used in support vector machines and logistic regressions, kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries. In this work, by replacing the inner product function in the softmax layer, we explore the use of kernels for contextual word classification. In order to compare the individual kernels, experiments are conducted on standard language modeling and machine translation tasks. We observe a wide range of performances across different kernel settings. Extending the results, we look at the gradient properties, investigate various mixture strategies and examine the disambiguation abilities.


page 1

page 2

page 3

page 4


An approach to the Gaussian RBF kernels via Fock spaces

We use methods from the Fock space and Segal-Bargmann theories to prove ...

Solving Support Vector Machines in Reproducing Kernel Banach Spaces with Positive Definite Functions

In this paper we solve support vector machines in reproducing kernel Ban...

Support Feature Machines

Support Vector Machines (SVMs) with various kernels have played dominant...

Support vector machine for functional data classification

In many applications, input data are sampled functions taking their valu...

Inner-product Kernels are Asymptotically Equivalent to Binary Discrete Kernels

This article investigates the eigenspectrum of the inner product-type ke...

End-to-end training of deep kernel map networks for image classification

Deep kernel map networks have shown excellent performances in various cl...

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Relative positional embeddings (RPE) have received considerable attentio...

1 Introduction

With neural networks, tasks such as language modeling (LM) and machine translation (MT) are generally approached by factorizing the target sentence probability into products of target word posterior probabilities

[bengio2003neural, sundermeyer2012lstm, sutskever2014sequence]

. In order to classify over the target vocabulary, it is necessary to compute a context vector, learn a projection matrix and normalize the similarity scores between the two probabilities. While various model architectures are proposed to calculate the context vectors

[bahdanau2014neural, gehring2017convolutional, vaswani2017attention], most of them use a softmax layer with the inner product function to compute the word posterior probabilities. [yang2017breaking] identify a shortcoming with the formulation above which they call the “Softmax Bottleneck”. The problem lies in the exponential-and-logarithm calculation when using the cross-entropy criterion, which results in a low-rank log word posterior probability matrix. Hypothesizing that natural language is high-rank, the authors argue that the “Softmax Bottleneck” is a limiting factor of the expressiveness of the models. One natural thought on this problem is its similarity to the lack of expressiveness of a logistic regression model or a support vector machine (SVM) with a simple linear kernel.

By implicitly transforming data points into high dimensional feature spaces, kernels can increase the expressiveness of the classifier and allow for more complex decision boundaries [bishop2006prml]. Note that a kernel is deemed valid when it corresponds to a scalar product in some feature space [bishop2006prml], or its corresponding Gram matrix is positive semidefinite [shawe2004kernel]. Yet empirical results [lin2003study, boughorbel2005conditionally] also show that conditionally positive semidefinite kernels can perform well in some applications. In this work, we do not enforce the positive semidefiniteness of kernels.

Motivated to examine the performances of various kernels in LM and MT, we structure this work as follows:

  1. [nolistsep]

  2. We implement individual kernels in replacement of the inner product function in the softmax layer and test them on LM and MT tasks.

  3. We look at the gradient properties of several kernels and analyze the observed performance differences.

  4. We investigate various mixtures of kernels.

  5. We further examine and compare the disambiguation abilities of the linear kernel and a mixture of kernels.

2 Related Work

The softmax layer with the inner product similarity function has limits in terms of expressiveness: [yang2017breaking] identify the “Softmax Bottleneck”, demonstrating its incapabilities to represent arbitrary target distributions. As a solution, they propose the “Mixture-of-Softmaxes” (MoS) architecture. [kanai2018sigsoftmax]

reanalyze the problem and suggest to include an extra sigmoid function in the softmax formula.

[herold2018improving] develop weight norm initialization and normalization methods on top of MoS. [takase2018direct] extend the architecture and introduce a regularization term to encourage equal contributions of mixture components.

Kernels are generally considered to be a family of energy functions, which can implicitly map data points into high dimensional spaces, allowing for the learning of complex decision boundaries [bishop2006prml]. [scholkopf2002learning] provide detailed and extensive information on the topic of learning with kernels using SVMs. [souza2010kernel] curates an incomplete list of popular kernels. [zhu2002kernel] build on kernel logistic regression and develop a classification algorithm called import vector machine. [cho2009kernel] explore the use of arccosine kernels in a multilayer nonlinear transformation setup. [memisevic2010gated] introduce a vector of binary latent variables and propose to use a bilinear scoring function in the softmax.

In pursuit of more powerful word representations, [vilnis2014word] and [athiwaratkun2017multimodal]

propose to embed words into Gaussian distributions to better capture entailment properties and multiple meanings of words.

[nickel2017poincare] show that an -dimensional Poincaré ball is a suitable space, in which one can embed words to better represent hierarchies. [dhingra2018embedding] describe a re-parametrization trick to automate the process of renormalizing word vector norms.

3 Methodology

3.1 Generalized Softmax

According to [kanai2018sigsoftmax]

, because of the “logarithm of exponential” calculation in the “softmax and cross entropy” setup, the non-linearity of the logarithm of the activated logit is a prerequisite to break the “Softmax Bottleneck”. While the paper presents a Sigsoftmax activation function applied on logits calculated with inner products, we explore many nonlinear kernel functions for the logit calculation, including the ones traditionally used in SVMs.

Specifically, we use a generalized softmax layer


with being the -th column of the projection matrix and being the -th transformed context vector. is shared across mixture components and each component uses kernel to calculate the logits. Both the mixture weight


and the transformed context vector


depend on the original context vector . In this setup, matrix , and are all trainable model parameters, where is the hidden dimension size.

There are two main motivations behind this generalized setup: first, by mapping to , we hope to transform the context vector into the respective feature space and generate different logit distributions over the vocabulary; second, by explicitly conditioning on , we hope the model is able to select which kernel is more appropriate for each context. Note that, in Equation 1, does not have a subscript of , which means we tie the projection matrices across the kernels. This greatly limits the expressiveness of our model, but is a compromise because of memory limitations.

3.2 Individual Kernels

In total, we implement and experiment with 9 individual kernels – linear (lin), logarithm (log), power (pow), polynomial (pol

), radial basis function (

rbf), symmetric spherical Gaussian (ssg) [vilnis2014word], symmetric spherical mixtures of Gaussian (mog) [athiwaratkun2017multimodal], non-parametric hyperbolic (hpb) [dhingra2018embedding] and wavelet (wav) [zhang2004wavelet]:


These individual kernels can all be thought of as energy functions between the context vector and the word vector . Because of the exponential calculation outside of the logit calculation, these kernels may result in numerically unstable computations. For example, using the rbf kernel results in an exponential-of-exponential operation, which easily blows up when and are distant. We nonetheless implement and examine the properties of these kernels.

Additionally, the memory consumption may blow up when using certain kernels. This is because the dimension reduction step in common to all kernels may not always be immediately executable. In this case, all pairwise similarities/distances between the context vectors and the word vectors have to be cached. To reduce memory usage, we apply several tricks: 1. use spherical covariance matrices, 2. simplify the wavelet kernel and 3. rewrite the formula of the power of the vector difference norm


which also suggests that can be thought of as a vector norm regularized version of the inner product.

4 Experiments

4.1 Experimental Setup

In this work, two datasets are used: Switchboard (SWB) for LM and IWSLT 2014 GermanEnglish (IWSLT) for MT. SWB is a relatively small dataset, with a vocabulary size of 30k and a training token count of 25M. For SWB, we use a standard 2-layer LSTM to generate context vectors, with 512 hidden dimensions and 0.1 dropout on the embedded word vectors. For IWSLT, we follow the setup in [edunov2017classical], using 160k parallel training sentences and 10k joint BPE merge operations. The transformer architecture is used to produce context vectors. We use 512 hidden dimensions in the encoder and decoder stacks, 1024 hidden dimensions in the fully-connected layers and 4 attention heads. As in Equation 1

, the context vectors are compared with the word vectors in the projection matrices. Hyperparameters of the kernels are tuned with grid search to give the best performance on the development set. We vary

and to test various kernel settings. We use the Fairseq toolkit [ott2019fairseq] to conduct the experiments.

4.2 Individual Kernels

The performances of models using individual kernels are summarized in Table 1. References from the literature are included to show the relative strengths of the kernels.

(PPL) (Bleu)
Ref. [irie2018investigation] [wu2019pay]
47.6 35.2
lin 46.8 34.3
log 103.0 0.4
pow 46.8 32.8
pol 47.3 31.7
rbf 284.9 0.0
ssg 49.9 34.6
mog 46.7 34.2
hpb 122.6 0.3
wav 289.7 0.0
Table 1: Performance of individual kernels.

Compared to the lin kernel, all other individual kernels have the exact same number of parameters and comparable run time. The only difference lies in how the logits are calculated. On both datasets, we see consistent behavior. While lin serves as a reasonably good baseline, pow, pol, ssg and mog are on the same level of performance, even slightly outperforming lin in some cases (mog on SWB and ssg on IWSLT). log and hpb are worse, giving much higher perplexity (PPL) and values close to zero in Bleu [papineni2002bleu]. Among all 9 kernels, rbf and wav perform the worst.

4.3 Gradient Properties

As shown in Section 4.2, a wide range of performances is observed across different kernels. In order to understand why some kernels perform better than others, we select four simple kernels (rbf, wav, log and pow) and plot their function graphs in Figure 1.

Figure 1: Graphs of rbf, wav, log and pow ().

All kernels have their maximum values at . In this case, the context vector is exactly the same as the word vector . The gradient properties, however, vary across these kernels. When far away from the optimum, pow has a constant non-zero gradient. On the other hand, rbf, wav and log have near-zero gradients. As approaches zero, the absolute gradient of log increases, while non-negligible gradients show up in rbf and wav only when is close to zero. We think strong supervised signals in the gradients are helpful for model convergence. Considering these gradient properties, we expect performances among these kernels to be: rbf wav log pow. The results in Table 1 fits our expectations very well. This further suggests that when selecting and designing alternative kernels, the gradient properties across the domain of the parameters should be carefully considered.

4.4 Mixtures of Kernels

Inspired by the MoS approach, we train LMs combining the outputs of multiple kernels according to Equation 1 on SWB. Similar to [takase2018direct]

, we add the variance of the mixture weights, scaled by

and averaged over data , to the standard cross entropy loss:

Variance  PPL
0.001 4.74 46.8
0.01 4.98 46.6
0.1 3.67 47.2
1 3.81 47.4
Table 2: Regularization of .
Name Mixture settings PPL
mos 9lin 47.8
mix 1 of each kernel 47.1
mix lin, log, rbf, hpb, wav 46.6
mix 3lin, log 46.5
mix lin, log, pow, pol 47.3
mix lin, log, rbf, hpb 47.1
mix 2lin, 2rbf 46.7
Table 3: Performance of mixtures of kernels.
Model Prediction
Ground Truth … books can end up being outdated very quickly
lin … books can end up being outdated very soon
mix … books can end up being outdated very quickly
Ground Truth … if you vote for a republican or vote for a democrat
lin … if you vote for a republican or vote for a republican
mix … if you vote for a republican or vote for a democrat
Table 4: Some examples of the disambiguation abilities of lin vs mix.

Performances of MoS systems for different values of are depicted in Table 2. We decide to run all mixture experiments with , as it seems to be a good compromise between regularization and performance.

The detailed mixture settings and perplexity results are summarized in Table 3. Specifically, we select “mos” to try to reproduce the “Softmax Bottleneck” paper [yang2017breaking] and “mix” to test a big mixture of each kernel. “mix”, “mix”, …, and “mix” are selected randomly to explore the kernel combination space. We also experiment with more mixture settings, but unfortunately with tied projection matrices, only those mixtures with the lin kernel give good performance. Note that weighted matrices are tied and multiple instances of the same kernel may be included in a mixture component. In this case, each mixture component is free in learning its own context vectors.

Compared to the individual kernels, the decoding speed of the mixture models is slowed down by a factor of two on average. The increased number of parameters because of context vector projection is negligible when the projection matrices are tied. As can be seen, all the mixture settings in Table 3 have similar performances to the simple lin setup in Table 1

. This is very likely because they all have at least one linear component, and the linear components consistently receive a total weight above 50%. So we conclude that mixtures of kernels using a shared projection matrix cannot significantly improve over the baseline. We find no fundamental difference between the open-sourced ”Mixture-of-Softmaxes” implementation

[yang2017breaking] and ours. Unfortunately, we can not replicate the results from the original paper. We do note that they use different datasets and include many more techniques like activation regularization and averaged SGD optimization.

4.5 Disambiguation Abilities

In theory, there is a potential drawback of the lin kernel used together with the softmax layer. Consider when two words and are close syntactically and/or semantically. It is a common observation that their corresponding word vectors are also close together after successful training [mikolov2013distributed, mikolov2013efficient, le2014distributed]. In this case, for any context vector , the logits and will be similar as well. Although the alternative kernels studied here also suffer from this problem: when , with non-linear activations the difference between the logits may be amplified, making it easier to disambiguate the words.

To show potentially better disambiguation properties of kernel mixtures, we take a more detailed look at the LM task. For the lin model, the projection matrix is extracted and the pairwise word distances are calculated using inner product. This is then used to extract word clusters in the embedding space. Two of the extracted clusters are: {quickly, slowly, soon, quick, easily} and {republicans, politicians, democrat, republican, democrats}. We suspect that it might be difficult for the lin model to distinguish words in these clusters, as their similarity scores are very close. It turns out that this is also what we observe when looking at the example sentences shown in Table 4. This suggests that even if diversifying the output layer with different kernels does not result in immediate improvements in terms of perplexity – a kernel-mixture-based method may still be superior in other aspects.

5 Conclusion

Motivated by the similarity between the “Softmax Bottleneck” problem and the lack of expressiveness of a logistic regression model or an SVM with a simple linear kernel, we explore the use of kernel functions in the softmax layer for contextual word classification:

  1. In replacement of the inner product function, kernels and mixtures of kernels are used in the softmax layer. Our experiments with 9 different individual kernels on LM and MT exhibit a wide range of performances, with lin, pol, pow, ssg and mog being the best-performing ones.

  2. Examining the gradient properties, we give reasons why some kernels perform better than others and argue that the gradient properties of a kernel function across the domain of the parameters is worthy of careful consideration.

  3. In mixture settings consisting of at least one lin kernel, lin consistently receives a large weight.

  4. While not significantly better than the lin kernel, we observe cases where the mixture model is better at disambiguating similar words.

In our mixture experiments, projection matrices are shared due to memory constraints. This greatly limits the expressiveness of the model. The next step is to untie the word embeddings across different kernels and allow for the learning of even more complex decision boundaries.

6 Acknowledgements

This work has received funding from the European Research Council (ERC) (under the European Union’s Horizon 2020 research and innovation programme, grant agreement No 694537, project ”SEQCLAS”) and the Deutsche Forschungsgemeinschaft (DFG; grant agreement NE 572/8-1, project ”CoreTec”). The GPU computing cluster was supported by DFG (Deutsche Forschungsgemeinschaft) under grant INST 222/1168-1 FUGG.