1 Introduction
Loglinear models, a general class that includes conditional random fields (CRFs) and generalized linear models (GLMs), offer a flexible yet tractable approach modeling conditional probability distributions
[1, 2]. When the set of possible values is large, however, the computational cost of computing a normalizing constant for each can be prohibitive—involving a summation with many terms, a highdimensional integral or an expensive dynamic program.The machine translation community has recently described several procedures for training “selfnormalized” loglinear models [3, 4]. The goal of selfnormalization is to choose model parameters that simultaneously yield accurate predictions and produce normalizers clustered around unity. Model scores can then be used as approximate surrogates for probabilities, obviating the computation normalizer computation.
In particular, given a model of the form
(1) 
with
(2) 
we seek a setting of such that is close enough to zero (with high probability under ) to be ignored.
This paper aims to understand the theoretical properties of selfnormalization. Empirical results have already demonstrated the efficacy of this approach—for discrete models with many output classes, it appears that normalizer values can be made nearly constant without sacrificing too much predictive accuracy, providing dramatic efficiency increases at minimal performance cost.
The broad applicability of selfnormalization makes it likely to spread to other largescale applications of loglinear models, including structured prediction (with combinatorially many output classes) and regression (with continuous output spaces). But it is not obvious that we should expect such approaches to be successful: the number of inputs (if finite) can be on the order of millions, the geometry of the resulting input vectors
highly complex, and the class of functions associated with different inputs quite rich. To find to find a nontrivial parameter setting with roughly constant seems challenging enough; to require that the corresponding also lead to good classification results seems too much. And yet for many input distributions that arise in practice, it appears possible to choose to make nearly constant without having to sacrifice classification accuracy.Our goal is to bridge the gap between theoretical intuition and practical experience. Previous work [5] bounds the sample complexity of selfnormalizing training procedures for a restricted class of models, but leaves open the question of how selfnormalization interacts with the predictive power of the learned model. This paper seeks to answer that question. We begin by generalizing the previouslystudied model to a much more general class of distributions, including distributions with continuous support (Section 3). Next, we provide what we believe to be the first characterization of the interaction between selfnormalization and model accuracy Section 4. This characterization is given from two perspectives:

a bound on the “likelihood gap” between selfnormalized and unconstrained models

a conditional distribution provably hard to represent with a selfnormalized model
In Section 5, we present empirical evidence that these bounds correctly characterize the difficulty of selfnormalization, and in the conclusion we survey a set of open problems that we believe merit further investigation.
2 Problem background
The immediate motivation for this work is a procedure proposed to speed up decoding in a machine translation system with a neuralnetwork language model
[3]. The language model used is a standard feedforward neural network, with a “softmax” output layer that turns the network’s predictions into a distribution over the vocabulary, where each probability is logproportional to its output activation. It is observed that with a sufficiently large vocabulary, it becomes prohibitive to obtain probabilities from this model (which must be queried millions of times during decoding). To fix this, the language model is trained with the following objective:
where is the response of output in the neural net with weights given an input . From a Lagrangian perspective, the extra penalty term simply confines the to the set of “empirically normalizing” parameters, for which all lognormalizers are close (in squared error) to the origin. For a suitable choice of , it is observed that the trained network is simultaneously accurate enough to produce good translations, and close enough to selfnormalized that the raw scores can be used in place of logprobabilities without substantial further degradation in quality.
We seek to understand the observed success of these models in finding accurate, normalizing parameter settings. While it is possible to derive bounds of the kind we are interested in for general neural networks [6], in this paper we work with a simpler linear parameterization that we believe captures the interesting aspects of this problem. ^{1}^{1}1It is possible to view a loglinear model as a singlelayer network with a softmax output. More usefully, all of the results presented here apply directly to trained neural nets in which the last layer only is retrained to selfnormalize [7].
Related work
The approach described at the beginning of this section is closely related to an alternative selfnormalization trick described based on noisecontrastive estimation (NCE)
[8]. NCE is an alternative to direct optimization of likelihood, instead training a classifier to distinguish between true samples from the model, and “noise” samples from some other distribution. The structure of the training objective makes it possible to replace explicit computation of each lognormalizer with an estimate. In traditional NCE, these values are treated as part of the parameter space, and estimated simultaneously with the model parameters; there exist guarantees that the normalizer estimates will eventually converge to their true values. It is instead possible to fix all of these estimates to one. In this case, empirical evidence suggests that the resulting model will also exhibit selfnormalizing behavior
[4].A host of other techniques exist for solving the computational problem posed by the lognormalizer. Many of these involve approximating the associated sum or integral using quadrature [9], herding [10], or Monte Carlo methods [11]. For the special case of discrete, finite output spaces, an alternative approach—the hierarchical softmax—is to replace the large sum in the normalizer with a series of binary decisions [12]. The output classes are arranged in a binary tree, and the probability of generating a particular output is the product of probabilities along the edges leading to it. This reduces the cost of computing the normalizer from to . While this limits the set of distributions that can be learned, and still requires greaterthanconstant time to compute normalizers, it appears to work well in practice. It cannot, however, be applied to problems with continuous output spaces.
3 Selfnormalizable distributions
We begin by providing a slightly more formal characterization of a general loglinear model:
Definition 1 (Loglinear models).
Given a space of inputs , a space of outputs , a measure on , a nonnegative function , and a function that is measurable with respect to its second argument, we can define a loglinear model indexed by parameters , with the form
(3) 
where
(4) 
If , then , and is a probability density over .^{2}^{2}2Some readers may be more familiar with generalized linear models, which also describe exponential family distributions with a linear dependence on input. The presentation here is strictly more general, and has a few notational advantages: it makes explicit the dependence of on and but not , and lets us avoid tedious bookkeeping involving natural and mean parameterizations. [13]
We next formalize our notion of a selfnormalized model.
Definition 2 (Selfnormalized models).
The loglinear model is selfnormalized with respect to a set if for all , . In this case we say that is selfnormalizable, and is selfnormalizing w.r.t. .
An example of a normalizable set is shown in (a), and we provide additional examples below:
Example.
Suppose
Then for either ,  
and is selfnormalizing with respect to .
It is also easy to choose parameters that do not result in a selfnormalized distribution, and in fact to construct a target distribution which cannot be selfnormalized:
Example.
Suppose
Then there is no such that for all , and is constant if and only if .
As previously motivated, downstream uses of these models may be robust to small errors resulting from improper normalization, so it would be useful to generalize this definition of normalizable distributions to distributions that are only approximately normalizable. Exact normalizability of the conditional distribution is a deterministic statement—there either does or does not exist some that violates the constraint. In (a), for example, it suffices to have a single off of the indicated surface to make a set nonnormalizable. Approximate normalizability, by contrast, is inherently a probabilistic statement, involving a distribution over inputs. Note carefully that we are attempting to represent but have no representation of (or control over) , and that approximate normalizability depends on but not .
Informally, if some input violates the selfnormalization constraint by a large margin, but occurs only very infrequently, there is no problem; instead we are concerned with expected
deviation. It is also at this stage that the distinction between penalization of the normalizer vs. lognormalizer becomes important. The normalizer is necessarily bounded below by zero (so overestimates might appear much worse than underestimates), while the lognormalizer is unbounded in both directions. For most applications we are concerned with log probabilities and logodds ratios, for which an expected normalizer close to zero is just as bad as one close to infinity. Thus the lognormalizer is the natural choice of quantity to penalize.
Definition 3 (Approximately selfnormalized models).
The loglinear distribution is approximately normalized with respect to a distribution over if . In this case we say that is approximately selfnormalizable, and is approximately selfnormalizing.
The sets of approximately selfnormalizing parameters for a fixed input distribution and feature function are depicted in (b). Unlike selfnormalizable sets of inputs, selfnormalizing and approximately selfnormalizing sets of parameters may have complex geometry.
Throughout this paper, we will assume that vectors of sufficient statistics have bounded norm at most , natural parameter vectors have norm at most (that is, they are Ivanovregularized), and that vectors of both kinds lie in . Finally, we assume that all input vectors have a constant feature—in particular, that for every (with corresponding weight ). ^{3}^{3}3It will occasionally be instructive to consider the special case where is the Boolean hypercube, and we will explicitly note where this assumption is made. Otherwise all results apply to general distributions, both continuous and discrete.
The first question we must answer is whether the problem of training selfnormalized models is feasible at all—that is, whether there exist any exactly selfnormalizable data distributions , or at least approximately selfnormalizable distributions for small . Section 3 already gave an example of an exactly normalizable distribution. In fact, there are large classes of both exactly and approximately normalizable distributions.
Observation.
Given some fixed , consider the set . Any distribution supported on is normalizable. Additionally, every selfnormalizable distribution is characterized by at least one such .
This definition provides a simple geometric characterization of selfnormalizable distributions. An example solution set is shown in (a). More generally, if is discrete and consists of repetitions of a fixed feature function (as in (a)), then we can write
(5) 
Provided is convex in for each , the level sets of as a function of form the boundaries of convex sets. In particular, exactly normalizable sets are always the boundaries of convex regions, as in the simple example (a).
We do not, in general, expect realworld datasets to be supported on the precise class of selfnormalizable surfaces. Nevertheless, it is very often observed that data of practical interest lie on other lowdimensional manifolds within their embedding feature spaces. Thus we can ask whether it is sufficient for a target distribution to be wellapproximated by a selfnormalizing one. We begin by constructing an appropriate measurement of the quality of this approximation.
Definition 4 (Closeness).
An input distribution is close to a set if
(6) 
In other words, is close to if a random sample from is no more than a distance from in expectation. Now we can relate the quality of this approximation to the level of selfnormalization achieved. Generalizing a result from [5], we have:
Proposition .
Suppose is close to . Then is approximately selfnormalizable (recalling that ).
(Proofs for this section may be found in Appendix A.)
The intuition here is that data distributions that place most of their mass in feature space close to normalizable sets are approximately normalizable on the same scale.
4 Normalization and model accuracy
So far our discussion has concerned the problem of finding conditional distributions that selfnormalize, without any concern for how well they actually perform at modeling the data. Here the relationship between the approximately selfnormalized distribution and the true distribution (which we have so far ignored) is essential. Indeed, if we are not concerned with making a good model it is always trivial to make a normalized one—simply take and then scale appropriately! We ultimately desire both good selfnormalization and good data likelihood, and in this section we characterize the tradeoff between maximizing data likelihood and satisfying a selfnormalization constraint.
We achieve this characterization by measuring the likelihood gap between the classical maximum likelihood estimator, and the MLE subject to a selfnormalization constraint. Specifically, given pairs , let . Then define
(7)  
(8) 
(where ).
We would like to obtain a bound on the likelihood gap, which we define as the quantity
(9) 
We claim:
Theorem 1.
Suppose has finite measure. Then asymptotically as
(10) 
(Proofs for this section may be found in Appendix B.)
This result lowerbounds the likelihood at by explicitly constructing a scaled version of that satisfies the selfnormalization constraint. Specifically, if is chosen so that normalizers are penalized for distance from (e.g. the logarithm of the number of classes in the finite case), then any increase in along the span of the data is guaranteed to increase the penalty. From here it is possible to choose an such that satisfies the constraint. The likelihood at is necessarily less than , and can be used to obtain the desired lower bound.
Thus at one extreme, distributions close to uniform can be selfnormalized with little loss of likelihood. What about the other extreme—distributions “as far from uniform as possible”? With suitable assumptions about the form of , we can use the same construction of a selfnormalizing parameter to achieve an alternative characterization for distributions that are close to deterministic:
Proposition .
Suppose that is a subset of the Boolean hypercube, is finite, and is the conjunction of each element of with an indicator on the output class. Suppose additionally that in every input , makes a unique best prediction—that is, for each , there exists a unique such that whenever , . Then
(11) 
for distributiondependent constants and .
This result is obtained by representing the constrained likelihood with a secondorder Taylor expansion about the true MLE. All terms in the likelihood gap vanish except for the remainder; this can be upperbounded by the
times the largest eigenvalue the feature covariance matrix at
, which in turn is bounded by .The favorable rate we obtain for this case indicates that “allnonuniform” distributions are also an easy class for selfnormalization. Together with Theorem 1, this suggests that hard distributions must have some mixture of uniform and nonuniform predictions for different inputs. This is supported by the results in Section 4.
The next question is whether there is a corresponding lower bound; that is, whether there exist any conditional distributions for which all nearby distributions are provably hard to selfnormalize. The existence of a direct analog of Theorem 1 remains an open problem, but we make progress by developing a general framework for analyzing normalizer variance.
One key issue is that while likelihoods are invariant to certain changes in the natural parameters, the log normalizers (and therefore their variance) is far from invariant. We therefore focus on equivalence classes of natural parameters, as defined below. Throughout, we will assume a fixed distribution on the inputs .
Definition 5 (Equivalence of parameterizations).
Two natural parameter values and are said to be equivalent (with respect to an input distribution ), denoted if
We can then define the optimal log normalizer variance for the distribution associated with a natural parameter value.
Definition 6 (Optimal variance).
We define the optimal log normalizer variance of the loglinear model associated with a natural parameter value by
We now specialize to the case where is finite with and where satisfies
This is an important special case that arises, for example, in multiway logistic regression. In this setting, we can show that despite the fundamental nonidentifiability of the model, the variance can still be shown to be high under
any parameterization of the distribution.Theorem 2.
Let and let the input distribution be uniform on . There exists an such that for , ,
5 Experiments
The highlevel intuition behind the results in the preceding section can be summarized as follows: 1) for predictive distributions that are in expectation highentropy or lowentropy, selfnormalization results in a relatively small likelihood gap; 2) for mixtures of high and lowentropy distributions, selfnormalization may result in a large likelihood gap. More generally, we expect that an increased tolerance for normalizer variance will be associated with a decreased likelihood gap.
In this section we provide experimental confirmation of these predictions. We begin by generating a set of random sparse feature vectors, and an initial weight vector
. In order to produce a sequence of label distributions that smoothly interpolate between lowentropy and highentropy, we introduce a temperature parameter
, and for various settings of draw labels from . We then fit a selfnormalized model to these training pairs. In addition to the synthetic data, we compare our results to empirical data [3] from a selfnormalized language model.(a) plots the tradeoff between the likelihood gap and the error in the normalizer, under various distributions (characterized by their KL from uniform). Here the tradeoff between selfnormalization and model accuracy can be seen—as the normalization constraint is relaxed, the likelihood gap decreases.
(b) shows how the likelihood gap varies as a function of the quantity . As predicted, it can be seen that both extremes of this quantity result in small likelihood gaps, while intermediate values result in large likelihood gaps.
6 Conclusions
Motivated by the empirical success of selfnormalizing parameter estimation procedures for loglinear models, we have attempted to establish a theoretical basis for the understanding of such procedures. We have characterized both selfnormalizable distributions, by constructing provably easy examples, and selfnormalizing training procedures, by bounding the loss of likelihood associated with selfnormalization.
While we have addressed many of the important firstline theoretical questions around selfnormalization, this study of the problem is by no means complete. We hope this family of problems will attract further study in the larger machine learning community; toward that end, we provide the following list of open questions:

How else can the approximately selfnormalizable distributions be characterized? The class of approximately normalizable distributions we have described is unlikely to correspond perfectly to realworld data. We expect that Section 3 can be generalized to other parametric classes, and relaxed to accommodate spectral or sparsity conditions.

Do corresponding lower bounds exist? While it is easy to construct of exactly selfnormalizable distributions (which suffer no loss of likelihood), we have empirical evidence that hard distributions also exist. It would be useful to lowerbound the loss of likelihood in terms of some simple property of the target distribution.

Is the hard distribution in Theorem 2 stable? This is related to the previous question. The existence of highvariance distributions is less worrisome if such distributions are comparatively rare. If the variance lower bound falls off quickly as the given construction is perturbed, then the associated distribution may still be approximately selfnormalizable with a good rate.
We have already seen that new theoretical insights in this domain can translate directly into practical applications. Thus, in addition to their inherent theoretical interest, answers to each of these questions might be applied directly to the training of approximately selfnormalized models in practice. We expect that selfnormalization will find increasingly many applications, and we hope the results in this paper provide a first step toward a complete theoretical and empirical understanding of selfnormalization in loglinear models.
References
 Lafferty et al. [2001] Lafferty, J. D.; McCallum, A.; Pereira, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. 2001; pp 282–289.
 McCullagh and Nelder [1989] McCullagh, P.; Nelder, J. A. Generalized linear models; Chapman and Hall, 1989.
 Devlin et al. [2014] Devlin, J.; Zbib, R.; Huang, Z.; Lamar, T.; Schwartz, R.; Makhoul, J. Fast and robust neural network joint models for statistical machine translation. Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2014.

Vaswani et al. [2013]
Vaswani, A.; Zhao, Y.; Fossum, V.; Chiang, D. Decoding with largescale neural language models improves translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013.
 Andreas and Klein [2014] Andreas, J.; Klein, D. When and why are loglinear models selfnormalizing? Proceedings of the Annual Meeting of the North American Chapter of the Association for Computational Linguistics. 2014.
 Bartlett [1998] Bartlett, P. L. IEEE Transactions on Information Theory 1998, 44, 525–536.
 Anthony and Bartlett [2009] Anthony, M.; Bartlett, P. Neural network learning: theoretical foundations; Cambridge University Press, 2009.

Gutmann and Hyvärinen [2010]
Gutmann, M.; Hyvärinen, A. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the International Conference on Artificial Intelligence and Statistics. 2010; pp 297–304.
 O’Hagan [1991] O’Hagan, A. Journal of statistical planning and inference 1991, 29, 245–260.
 Chen et al. [2010] Chen, Y.; Welling, M.; Smola, A. Proceedings of the Conference on Uncertainty in Artificial Intelligence 2010, 109–116.
 Doucet et al. [2001] Doucet, A.; De Freitas, N.; Gordon, N. An introduction to sequential Monte Carlo methods; Springer, 2001.
 Morin and Bengio [2005] Morin, F.; Bengio, Y. Proceedings of the International Conference on Artificial Intelligence and Statistics 2005, 246.
 Yang et al. [2012] Yang, E.; Allen, G.; Liu, Z.; Ravikumar, P. K. Graphical models via generalized linear models. Advances in Neural Information Processing Systems. 2012; pp 1358–1366.
Appendix A Normalizable distributions
Proof of Section 3 (distributions close to normalizable sets are approximately normalizable).
Let , where .
Then,
for ,  
Appendix B Normalization and likelihood
b.1 General bound
Lemma .
If , then is approximately normalized about .
Proof.
If ,
The case where is analogous, instead replacing with . The variance result follows from the fact that every logpartition is within of the mean. ∎
Proof of Theorem 1 (loss of likelihood is bounded in terms of distance from uniform).
Consider the likelihood evaluated at , where . We know that (if , then the MLE already satisfying the normalizing constraint). Additionally, is approximately normalized. (Both follow from subsection B.1.)
Then,
Because is convex in ,  
Thus,  
b.2 Allnonuniform bound
We make the following assumptions:

Labels are discrete. That is, for some .

. That is, each is a indicator vector drawn from the Boolean hypercube in dimensions.

Joint feature vectors are just the features of conjoined with the label . Then it is possible to think of as a sequence of vectors, one per class, and we can write .

As in the body text, let all MLE predictions be nonuniform, and in particular let each for .
Lemma .
For a fixed , the maximum covariance between any two features and under the model evaluated at some in the direction of the MLE:
(12) 
Proof.
If either or is not associated with the class , or associated with a zero element of , then the associated feature (and thus the covariance at ) is identically zero. Thus we assume that and are both associated with and correspond to nonzero elements of .
Suppose is the majority class. Then,  
Now suppose is not in the majority class. Then,  
Thus the covariance  
∎
Lemma .
Suppose for some . Then for a sequence of observations , under the model evaluated at , the largest eigenvalue of the feature covariance matrix
(13) 
is at most
(14) 
Proof.
From subsection B.2, each entry in the covariance matrix is at most . At most features are nonzero active in any row of the matrix. Thus by Gershgorin’s theorem, the maximum eigenvalue of each term in Equation 13 is , which is also an upper bound on the sum. ∎
Proof of Section 4 (loss of likelihood goes as ).
As before, let us choose , with . We have already seen that this choice of parameter is normalizing.
Taking a secondorder Taylor expansion about , we have
where the firstorder term vanishes because is the MLE. It is a standard result for exponential families that the Hessian in the secondorder term is just Equation 13. Thus we can write  
The proposition follows. ∎
Appendix C Variance lower bound
Let
Lemma .
If , then equivalence of natural parameters is characterized by
Proof.
For , denote by the distribution over . Now, suppose that and fix . By the definition of equivalence, we have
which immediately implies
whence
Since this holds for all and , we get
That is, if we define
we get
and , as required.
Conversely, if , choose an appropriate . We then get
It follows that
so that
and the claim follows. ∎
The key tool we use to prove the theorem reinterprets as the norm of an orthogonal projection. We believe this may be of independent interest. To set it up, let be the Hilbert space of squareintegrable functions with respect to the input distribution , define
and
We then have
Lemma .
Let . Then
The second key observation, which we again believe is of independent interest, is that under certain circumstances, we can completely replace the normalizer by . For this, we define
and correspondingly let .
Proof.
By Lemma C, we have
But now, we observe that this can be rewritten with the aid of the isomorphism defined by the identity
to read
as required. ∎
Lemma .
Suppose for each , there is a unique such that and such that for , for some . Then
Proof.
Denote by the centered version of . Using the identity , we immediately see that
It follows that
We thus have
The claim follows. ∎
If we let
Corollary .
For , we have
Proof.
For this, observe first that if , then