1 Introduction
Neural autoregressive models are explicit density estimators that achieve stateoftheart likelihoods for generative modeling [3, 4, 5, 7]. The Ddimensional data distribution
is factorized into an autoregressive product of onedimensional conditional distributions according to the chain rule. Each conditional distribution is parametrized by a shared neural network.
Data completion is a more involved task than data generation: the model must infer missing variables for any partially observed input vector. Previous work [5] introduced an orderagnostic training procedure for data completion with autoregressive models. It maximizes the average likelihood of the model over all orderings of the data dimensions. As a result, all possible onedimensional conditionals are trained, for any and any ordering of . Thus, missing variables in any partially observed input vector can be imputed efficiently by choosing an ordering where observed dimensions precede unobserved ones and by computing the autoregressive product in this order. This training procedure can be made efficient: [6]
estimates the orderagnostic loss with an unbiased estimator that reuses most computations.
In this paper, we provide evidence that the orderagnostic (OA) training procedure is suboptimal for data completion. We propose an alternative procedure (OA++) that reaches better performance in fewer computations. It can handle all data completion queries while training fewer onedimensional conditional distributions than the OA procedure. In addition, these onedimensional conditional distributions are trained proportionally to their expected usage at inference time, reducing overfitting. Finally, our OA++ procedure can exploit prior knowledge about the distribution of inference completion queries, as opposed to OA. We support these claims with quantitative experiments on standard datasets used to evaluate autoregressive generative models.
2 Improving the orderagnostic loss for data completion
The OA procedure trains autoregressive models for data completion by optimizing the loss in Equation 1. In practice, the exact loss has too many terms to be computationally tractable. It is estimated by sampling a training vector uniformly at random, a number and a set of conditioned variables uniformly at random, and by computing as in Equation 2. The sum in is computed over all possible choices of , and the neural network computations involved in can be reused across the terms of the sum.
(1)  
(2) 
In this section, we present and motivate stepbystep the modifications we propose to the OA procedure (Equation 1), leading to the OA++ procedure (Equation 5).
2.1 Equal use of 1D conditional distributions at training and inference
Ideally, the usage of each 1D conditional distribution at inference time should be proportional to its usage at training time. Otherwise, some 1D conditional distributions frequently used at inference time might be undertrained (undermining performance), and some 1D conditional distributions rarely used at inference time might be overtrained (undermining efficiency). For , we refer to 1D conditional distributions with conditioned variables, , as 1D conditional distributions of size .
Result: Under the OA procedure, at any training iteration, each 1D conditional distribution of size
has a fixed probability
of being trained.Proof: At any training iteration, exactly one 1D conditional distribution of size is trained. It is chosen uniformly at random among the 1D conditional distributions of size .
Result: Under the OA procedure, each 1D conditional distribution of size has probability of being involved in any generation query.
Proof: Exactly one 1D conditional distribution of size is involved in any generation query, as each of the variables must be sampled in sequence. There are such distributions. The OA procedure chooses a variable ordering uniformly at random. Thus, any 1D conditional distribution of size has an equal probability of being used in any generation query.
Therefore, under the OA procedure, the expected usage of any 1D conditional distribution at inference time is proportional to its expected usage at training time for generation queries. We are interested in data completion queries, however. In a completion query, the values of some variables are known, and thus conditional distributions for those variables do not need to be computed. This changes the expected usage patterns for each size of conditional distribution. Using the OA procedure leads to a discrepancy between the usage of 1D conditional distributions during training and inference. Further, if we know something in advance about the distribution of inference queries (e.g. which variables will be known, or how many will be known), the OA procedure has no way to exploit such prior knowledge.
The OA++ procedure we propose does not suffer from these limitations. It assumes that inference queries follow a distribution , and it trains all 1D conditional distributions proportionally to their expected usage during inference under this distribution. If we have prior knowledge about the expected structure of inference queries, this can be encoded in . If we have no such prior knowledge, then OA++ sets
to be a uniform distribution over inference queries, i.e. the set of observed variables in a completion query is drawn uniformly at random.
Instead of optimizing over all D! orderings of as in the OA procedure, OA++ samples from the expected distribution of inference queries , it samples an ordering of the unobserved input uniform at random, and it optimizes . Since , there are possible orderings of the unobserved input. The corresponding loss is expressed in Equation 3:
(3) 
2.2 Training fewer 1D conditional distributions
The OA procedure maximizes the average loglikelihood of the model over all orderings: in Equation 1, the ordering
is treated as a uniform random variable over its
values. Consequently, all 1D conditional distributions are trained. However, these conditional distributions will not all be used. To handle any given query, the model must either fix an order of the unobserved variables, or use an ensemble of orderings of the unobserved variables as in [5]. In most settings, , thus far fewer 1D conditionals will be used than were trained. Since the parameters of all conditionals are determined by a single shared neural network, the model is wasting its representational capacity on 1D conditionals that will not be used at inference time.The OA++ procedure we propose instead trains at most conditional distributions, much fewer than the total number of 1D conditional distributions, and it can still handle any completion query by using an ensemble of K orderings. To do so, we fix in advance K orderings of . For any completion query , an ordering is sampled uniform at random from . The autoregressive sum is computed over the unobserved input (data dimensions are ordered according to ). Fundamentally, OA++ treats the ordering as a uniform random variable over K values, instead of values for OA.
The OA++ loss for data completion is expressed as in Equation 5. is the distribution of inference queries if we have prior knowledge of it; otherwise is the uniform random distribution of completion queries. K is the maximum number of inference queries that we expect to average by ensembling at inference time. For a given completion query , the data dimensions of are reordered according to ordering .
(4)  
(5) 
Unbiased Estimator The OA++ loss in 5 has a large number of terms. It can be estimated by sampling a training example uniformly at random, a completion query , a number , and by computing as in Equation 6.
(6) 
It is possible to provide an unbiased estimator of the loss, that consists of a sum with computations shared among its different terms, like in Equation 2. However, under the assumption , the sum would likely consist of a single term, effectively reducing to Equation 6.
Unifying data generation and data completion
By choosing , OA++ reduces to the original training procedure for NADEs introduced in [2].
By choosing , OA++ reduces to the orderagnostic training procedure for data generation introduced in [6]. In other words, OA++ unifies the training procedures of data generation and data completion under a single framework.
3 Results
In order to compare OA and OA++, we train the same autoregressive model (twolayer NADE) with both procedures. We conduct experiments on eight multivariate binary datasets commonly used by previous works on autoregressive models [1, 2, 5]. Table 1 reports the performance of all models on 2 test sets of inference queries: one consists of uniform random completion queries; the other consists of completion queries of size picked at random. On the second test set, models were provided prior knowledge of the distribution of inference queries. Results are computed for (no ensemble learning). The comparison of OA and OA++ for is left as future work.
Performance at convergence OA++ outperforms OA on all experiments and does especially well when given prior knowledge about the distribution of inference queries.
Model  Adult  DNA  Mushrooms  NIPS012  Connect4  OCRletters  RCV1  Web 

OA  9.8 / 13.6  87.2 / 90.7  5.2 / 9.5  277.0 / 280.7  9.4 / 14.7  31.7 37.8  47.0 / 48.1  28.8 / 30.1 
OA++  7.8 / 11.9  78.2 / 83.3  4.2 / 7.8  272.0 / 276.4  4.7 / 9.5  23.0 / 27.7  46.0 / 47.2  27.9 / 29.1 
Speed of convergence OA++ converges in fewer computations than OA. OA++ also does not suffer from overfitting, while OA sometimes does. Figure 1 reports the evolution of the training and validation loglikelihoods with the number of computations, for some experiments.
References

[1]
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle.
Made: masked autoencoder for distribution estimation.
InProceedings of the 32nd International Conference on Machine Learning (ICML15)
, pages 881–889, 2015. 
[2]
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 29–37, 2011.  [3] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [4] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 [5] Benigno Uria, MarcAlexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. Journal of Machine Learning Research, 17(205):1–37, 2016.
 [6] Benigno Uria, Iain Murray, and Hugo Larochelle. A deep and tractable density estimator. In International Conference on Machine Learning, pages 467–475, 2014.
 [7] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
Comments
There are no comments yet.