1 Introduction
The work presented in this paper is motivated by prediction problem in the context of autonomous driving. Prediction methods transform the history of perception data up to the current time step into a representation of how the environment will evolve over a short time horizon. The process is mired with challenges due to dynamic interactions, and the myriad of possibles events that ensue.
Due to their predictive modeling abilities, deep learning models have emerged leveraging the power of RNNs for sequence predictions and CNNs to encode raw sensor data and prior scene features
[19, 2, 25, 16, 22, 23, 8, 3, 17]. In this regard, advanced predictions models exhibit the following characteristics:
probabilistic
: probability distributions reflecting future state uncertainties,

multimodal: reproducing the rich diversity of states,

context driven: support for interactive, multiactor and contextual reasoning,

efficient: result are produced to match the sensing rates, and

general: capable of reasoning novel inputs in a stable and consistent way.
However, current prediction methods have limitations regarding the expression of uncertainty, multimodality, interaction, and/or from expensive sampling steps. We address those constraints using a novel method called Hyperconditioned Neural Autoregressive Flow (HCNAF). HCNAF performs an exact likelihood inference by precisely computing probability of arbitrarily complex target distributions , with or without conditions . As an alternative to traditional trajectory predictions, we apply HCNAF to produce probabilistic occupancy maps (POMs) conditioned on the scene contexts (see figure 1). By directly obtaining exact probabilities on the POM, HCNAF removes the need to sample trajectories from distributions.
We present results on selfdriving scenarios, but we first report results from density estimation tasks to investigate HCNAF’s generalization capability over diverse conditions.
2 Background
Flow, or normalizing flow, is a type of deep generative models, which aim to learn data distribution via the principle of maximum likelihood [7] so as to generate new data and/or estimate likelihood of a target distribution.
Flowbased models construct an invertible function between a latent variable
and a random variable
, which allows the computation of exact likelihood of an unknown data distribution using a known pdf(e.g. normal distribution), via the change of variable theorem:
(1) 
In addition, flow offers data generation capability by sampling latent variables and passing it through . As the accuracy of the approximation increases, the modeled pdf converges to the true and the quality of the generated samples also improves.
In contrast to other classes of deep generative models (namely VAE[13] and GAN[6]), flow is an explicit density model and offers unique properties:

Computation of an exact probability, which is essential in the POM forecasting task. VAE infers using a computable term, Evidence Lower BOund (ELBO). However, it is unclear how ELBO can be used for tasks that require an exact probability computation for , and uncertain how well ELBO actually approximates as its upper bound is unknown. While GAN proved its power in generating highquality samples for image generation and translation tasks[12, 4], it is unclear how the likelihood estimation and/or probability computation for the generated samples is obtained.

The expressivity of flowbased models, which allows the models to capture complex data distributions. A recently published AF model called Neural Autoregressive Flow (NAF)[11] unified earlier AF models including [14, 21] by generalizing their affine transformations to arbitrarily complex nonlinear monotonic transformations. Conversely, the default VAE uses unimodal Gaussians for the prior and the posterior distributions. In order to increase the expressivity of VAE, some have introduced more expressive priors [26] and posteriors [15, 1] that leveraged flow.
The class of invertible neuralnet based autoregressive flows, including NAF and BNAF[5], is capable of approximating rich families of distributions, since it was shown that it is an universal approximator for continuous pdfs.
However, NAF and BNAF do not handle external conditions (e.g. classes or categories in the context of GAN vs cGAN[20]). That is, those models are designed to compute conditioned on previous inputs autoregressively to formulate . This formulation is not suitable for taking arbitrary conditions other than the autoregressive ones. This limits the extension of NAF to the applications that work with conditional probabilities , such as the POM forecasting problem.
(2) 
[21] proposed which models affine flow transformations and models which additionally take external conditions. As shown in Equation 2, the transformation between and is affine and the influence of over the transformation relies on , , and stacking multiple flows. These may limit the contributions of to the transformation. This explains the needs for a conditional autoregressive flow that does not have such expressivity bottleneck.
3 HyperConditioned Neural Autoregressive Flow (HCNAF)
We propose HyperConditioned Neural Autoregressive Flow (HCNAF), a novel autoregressive flow where a transformation between and
is modeled using a nonlinear neural network
whose parameters are determined by arbitrarily complex conditions in nonautoregressive fashion, via a separate neural network . is designed to compute the parameters for, thus being classified as an hypernetwork
[9]. HCNAF models a conditional joint distribution
autoregressively on , by factorizing it over conditional distributions .NAF[11] and HCNAF both use neural networks but those are different in probability modeling, conditioner network structure, and the flow transformation function as elaborated below as:
(3) 
(4) 
In Equations 3, NAF uses a conditioner network to obtain the parameters for the transformation between and , which is parameterized by autoregressive conditions . In contrast, in Equations 4, HCNAF models the transformation to be parameterized on both , and an arbitrarily large external conditions in nonautoregressive fashion via the hypernetwork . For probability modeling, the difference between the two is analogous to the difference between VAE[13] and conditional VAE[24], and that between GAN[6] and conditional GAN[20].
As illustrated in Figure 1, HCNAF consists of two main modules: 1) a neuralnet based conditional autoregressive flow, and 2) a hypernetwork which computes the parameters (i.e. weights, and biases) of 1). The modules are detailed in the following subsections.
3.1 NNbased Conditional Autoregressive Flow
The proposed conditional AF is a bijective neuralnetwork , which models transformation between random variables and latent variables . The network parameters are determined by the hypernetwork
. The main difference between regular feedforward neural networks and flow models is the invertibility of
, whereas regular networks are not typically invertible.The conditional AF is shown in Figure 2. In each dimension of the flow, the bijective transformation between and
are modeled with a multilayer perceptron (MLP) with
hidden layers as follows:(5) 
The connection between two adjacent hidden layers and is defined as:
(6) 
where subscript and superscript each denotes flow number and layer number. Specifically, is the hidden layer of the th flow. and denote the weight matrix which defines contributions to the hidden layer of the th flow from the hidden layer of the th flow, and the bias matrix which defines the contributions to the hidden layer of the th flow. Finally,
is an activation function.
The connection between and the first hidden layer, and between the last hidden layer and are defined as:
(7)  
are the hidden units at the hidden layer across all flow dimensions and are expressed as:
(8) 
where and are the weights and biases matrices at the hidden layer across all flow dimensions:
(9) 
Likewise, W and B denote the weights and biases matrices for all flow dimensions across all the layers. Specifically, and .
Finally, is obtained by computing the terms from Equation 8 for all the network layers, from the first to the last layer, .
We designed HCNAF so that the hidden layer units are connected to the hidden units of previous layers , inspired by BNAF, as opposed to taking as inputs to a separate hypernetwork to produce over , such as presented in NAF. This approach avoids running the hypernetwork times; an expensive operation for large hypernetworks. By designing the hypernetwork to output all at once, we reduce the computation load, while allowing the hidden states across all layers and all dimensions to contribute to the flow transformation, as is conditioned not only on , but also on all the hidden layers .
All Flow models must satisfy the following two properties: 1) monotonicity of to ensure its invertibility, and 2) tractable computation of the jacobian matrix determinant .
3.1.1 Invertibility of the Autoregressive Flow
The monotonicity requirement is equivalent to having , which is further factorized as:
(10) 
where is expressed as:
(11) 
denotes the preactivation of . The invertibility is satisfied by choosing a strictly increasing activation function (e.g. tanh or sigmoid) and a strictly positive . is made strictly positive by applying an elementwise exponential to all entries in at the end of the hypernetwork, inspired by [5]. Note that the operation is omitted for the nondiagonal elements of .
3.1.2 Tractable Computation of Jacobian Determinant
The second requirement for flow models is to efficiently compute the jacobian matrix determinant , where:
(12) 
Since we designed to be lowertriangular, the product of lowertriangular matrices, , is also lowertriangular, whose log determinant is then simply the product of the diagonal entries: , as our formulation states . Finally, is expressed via Equations 10 and 11.
(13) 
Equation 13 involves the multiplication of matrices in different sizes; thus cannot be broken down to a regular log summation. To resolve this issue, we utilize logsumexp operations as it is commonly utilized in the flow community (e.g. NAF[11] and BNAF[5]) for numerical stability and efficiency of the computation. This approach to computing the jacobian determinant is similar to the one presented in BNAF, as our conditional AF resembles its flow model.
As HCNAF is a member of the monotonic neuralnet based autoregressive flow family like NAF and BNAF, we rely on the proofs presented NAF and BNAF to claim that HCNAF is also a universal distribution approximator.
3.2 Hyperconditioning and Training
The key point from Equation 5  13 and Figure 2 is that HCNAF is constraintfree when it comes to the design of the hypernetwork. The flow requirements from Sections 3.1.1 and 3.1.2 do not apply to the hypernetwork. This enables the hypernetwork to grow arbitrarily large and thus to scale up with respect to the size of conditions. The hypernetwork can therefore be an arbitrarily complex neural network with respect to the conditions .
We seek to learn the target distribution using HCNAF by minimizing the negative loglikelihood (NLL) of , i.e. the cross entropy between the two distributions, as in:
(14) 
Note that minimizing the NLL is equivalent to minimizing the (forward) KL divergence between the data and the model distributions , as where is bounded.
4 Probabilistic Occupancy Map Forecasting
In Section 3, we showed that HCNAF can accommodate highdimensional condition inputs for conditional probability density estimation problems. We leverage this capability to tackle the probabilistic occupancy map (POM) of actors in selfdriving tasks. This problem operates on over one million dimensions, as spatiotemporal multiactor images are part of the conditions. This section describes the design of HCNAF to support POM forecasting. We formulate the problem as follows:
(15) 
where is the past states, with as the dimension of the observed state, over a time span . denotes the past states for all neighboring actors over the same time span. encodes contextual static and dynamic scene information extracted from map priors (e.g. lanes and stop signs) and/or perception modules (e.g. bounding boxes for actors) onto a rasterized image of size by with channels. However comprehensive, the list of conditions in is not meant to be limitative; as additional cues are introduced to better define actors or enhance context, those are appended to the conditions. We denote as the location of an actor over the 2D bev map at time , by adapting our conditional AF to operate on 2 dimensions. As a result, the joint probability is obtained via autoregressive factorization given by .
It’s possible to compute , a joint probability over multiple time steps via Equation 4, but we instead chose to compute (i.e. a marginal probability distribution over a single time step) for the following reasons:

Computing implies the computation of autoregressively. While this formulation reasons about the temporal dependencies between the history and the future, it is forced to make predictions on dependent on unobserved variables and . The uncertainties of the unobserved variables have the potential to push the forecast in the wrong direction.

The computation of is intractable in nature since it requires a marginalization over all variables . We note that is practically impossible to integrate over.
Note, we instead obtain POMs over all time by incorporating a time variable as part of the conditions, .
In addition to POMs, HCNAF can be used to sample trajectories using the inverse transformation . The exact probabilities of the generated trajectories can be computed via Equation 1. Even though HCNAF has capability of producing trajectory samples, we focused on POMs throughout the paper.
5 Experiments
In this section, we demonstrate the effectiveness of HCNAF on density estimation tasks for Toy Gaussians. We then verify the scalability of HCNAF by tackling more challenging POM forecasting problems for autonomous driving in simulated urban scenarios using two datasets: 1) Virtual Simulator: a dataset from simulated driving environments with diverse road geometries, including multiple road actors designed to mimic human drivers. The scenarios are based on real driving logs collected over NorthAmerican cities. 2) PRECOGCarla
: a dataset created using the opensource Carla simulator for autonomous driving research. It was made publicly available in
[22].5.1 Toy Gaussians
We conducted two experiments to demonstrate the performance of HCNAF for density estimations. The first was also used in the NAF paper [11], and aims to show the model’s learning ability for three distinct probability distributions over a 2D grid map, . The three nonlinear distributions are distinct groups of gaussians over the grid. In the second test, we demonstrate how HCNAF can generalize its outputs for previously unseen conditions.
5.1.1 Toy Gaussians: Experiment 1
AAF  NAF  HCNAF (ours)  
2 by 2  6.056  3.775  3.896 
5 by 5  5.289  3.865  3.966 
10 by 10  5.087  4.176  4.278 
Results from Figure 3 and Table 1 show that HCNAF is able to reproduce the three nonlinear target distributions, and to achieve comparable results as those using NAF, albeit with a small increase in NLL. We emphasise that HCNAF uses a single model (with a 1dimensional condition variable) to produce the three distinct pdfs, whereas AAF (Affine AF) and NAF used three distinctly trained models. The autoregressive conditioning applied in HCNAF is the same as for the other two models. The hypernetwork of HCNAF uses where each value represents a class of 2by2, 5by5, and 10by10 gaussians.
5.1.2 Toy Gaussians: Experiment 2
From the density estimation experiment shown in Figure 2, we observed that HCNAF is capable of generalization over unseen
conditions, i.e. values in the condition terms that were intentionally omitted during training. The experiment was designed to verify that the model would interpolate and/or extrapolate probability distributions beyond the set of conditions it was trained with, and to show how effective HCNAF is at reproducing both the target distribution
for . As before, we trained a single HCNAF model to learn 5 distinct pdfs, where each pdf represents a gaussian distribution with its mean (center of the 2D gaussian) used as conditionsand with an isotropic standard deviation
of 0.5.  
1.452      
  1.489  1.552  
  0.037  0.100 
For this task, the objective function is the maximization of loglikelihood, which is equivalent to the maximization of the KL divergence where is uniformly sampled from the set of conditions . Table 2 provides quantitative results from the cross entropy and a KL divergence . Note that is lowerbounded by since . The differential entropy of an isotropic bivariate gaussian distribution and is computed using: . The results show that HCNAF is able to generalize its predictions for unseen conditions as shown by the small deviation of from its lower bound .
5.2 Forecasting POM for Autonomous Driving
In this section, we show how HCNAF can be scaled up to tackle the POM forecasting problems for autonomous driving. The condition for the POM prediction problems is significantly larger when compared to that from the experiments in Section 5.1, as shown in Equation 15. now includes information extracted from various sensors (lidar, camera), maps (lanes, stopsigns), and perception detections (expressed as bounding boxes for actors), whose dimension is typically summed up to millions. As per its design, HCNAF is unaffected by the increase in condition dimensions, as the hypernetwork can easily be extended to support any new parameters, (memory allowing).
Figure 5 depicts the customized hypernetwork for the POM forecasting for autonomous vehicles. The hypernetwork takes perception inputs as the condition , and outputs a set of network parameters W and B for the subsequent HCNAF’s conditional AF . The inputs come from various sensors (lidar or camera) through a perception module and also from prior map information. Specifically, is formed with 1) the bev images which include lanes, stopsigns, and actors depicted as bounding boxes in a 2D grid map (see Figure 6), and 2) the pose of actors in actorcentric pixel coordinates. The perception module used reflects other standard approaches for processing multisensor data, such as [10]. The hypernetwork consists of three main components: 1) LSTM modules, 2) an encoder module, and 3) a time module. The outputs of the three modules are concatenated and fed into an MLP, which outputs W and B, as shown in Figure 5.
The LSTM module takes the states of an actor in the scene where to encode temporal dependencies and trends among the state parameters. A total of LSTM modules are used to model the actors and the reference car for which we produce the POM forecasts. The resulting outputs are , and .
The encoder module takes in the bev images denoted as
. The role of this module is to transform the scene contexts into a onedimensional tensor that is concatenated with other parameters of our conditional AF flow module. We use residual connections to enhance the performance of our encoder as in
[10]. Since our hypernetwork works with Cartesian (x,y) space and pixel (image) space, we use coordinate convolution (coordconv) layers as in [18] to strengthen the association between the two data. Overall, the encodernetwork consists of 4 encoder blocks, and each encoder block consists of 5 coordconv layers with residual connections, maxpooling layers, and batchnormalization layers. The resulting output is
.Lastly, the time layer adds the forecasting time , i.e. time span of the future away from the reference (or present) time . In order to increase the contribution of the time condition, we apply an MLP which outputs a hidden variable for the time condition .
Forecasting POM with a Virtual Simulator Dataset
HCNAF with the custom hypernetwork was trained on an internal dataset that we call Virtual Simulator. The dataset is comprised of bev images of size , where may include all, or a subset of the following channels: stop signs, street lanes, reference car locations, and a number of actors. We also add the history of actor states in pixel coordinates, as discussed in the previous subsection. For each of the vehicles/actors, we apply a coordinate transformation to obtain actorcentric labels and images for training. The vehicle dataset includes parked vehicles and noncompliant road actors to introduce common and rare events (e.g. sudden lane changes or sudden stopping in the middle of the roads). We produce POM for all visible vehicles, including parked vehicles, and noncompliant actors, even if those are not labeled as such. Note that the dataset was created out of several million examples, cut into snippets of 5 seconds in duration. Figure 6 depicts POM forecast for three scenarios sampled from the test set. An ablation study on the Virtual Simulator shows the impact of different hypernetworks inputs on the POM forecasting accuracy and is presented in the supplementary materials.
As discussed in Section 4, HCNAF produces not only POM, but also trajectory samples via the inverse transformation of the conditional AF . As we advocate the POM approach, we do not elaborate further on the trajectory based approach using HCNAF.
Forecasting POM with PRECOGCarla Dataset
We trained HCNAF on the PRECOGCarla Town01train dataset and validated the progress over Town01val dataset publicly available [22]. The hypernetwork used for this experiment was identical to one used for the Virtual Simulator dataset, except that we substituted the bev images with two of the raw overhead lidar channels; the above ground and ground level inputs. The encoder module input layer was updated to process the lidar image size (200x200) of the PRECOGCarla dataset. In summary, included the lidar data, and the history of the reference car and other actors.
To evaluate the performance of the trained models, [22] used the extra nats metric for the likelihood estimation instead of NLL. is a normalized, bounded likelihood metric defined as , where each represents the crossentropy between (perturbed with an isotropic gaussian noise) and [22], prediction horizon, number of actors, and dimension of the actor position. We used the same as cited, whose differential entropy is analytically obtained using . We computed over all timesteps available in the dataset. The results are presented in Table 3.
Method  Test (): Lower is better 
PRECOGESP, no lidar  0.699 
PRECOGESP  0.634 
HCNAF, no lidar (ours)  0.184 
HCNAF (ours)  0.114 (5+ times lower) 
We believe that HCNAF performed better than PRECOGESP, which is a stateofthearts prediction model in autonomous driving, by taking advantage of the nonlinear flow transformation and having condition terms affecting the hidden states of all layers of the HCNAF’s neuralnet based AF. Note, PRECOG utilizes bijective transformations that is rooted in affine autoregressive flow, similar to cMAF (See Equation 2). We also believe that the HCNAF’s generalization capability is a contributing factor that explains how HCNAF is able to estimate probability densities conditioned on previously unseen contexts.
6 Conclusion
We present HCNAF, a novel universal distribution approximator tailored to model conditional probability density functions. HCNAF extends neural autoregressive flow [11] to take arbitrarily large conditions, not limited to autoregressive conditions, via a hypernetwork which determines the network parameters of HCNAF’s AF. By modeling the hypernetwork constraintfree, HCNAF enables it to grow arbitrarily large and thus to scale up with respect to the size of nonautoregressive conditions. We demonstrate its effectiveness and capability to generalize over unseen conditions on density estimation tasks. We also scaled HCNAF’s hypernetwork to handle larger conditional terms as part of a prediction problem in autonomous driving.
References
 [1] (2018) Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649. Cited by: item 2.

[2]
(201829–31 Oct)
IntentNet: learning to predict intention from raw sensor data.
In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.),
Proceedings of Machine Learning Research
, Vol. 87, , pp. 947–956. External Links: Link Cited by: §1. 
[3]
(2019)
Traphic: trajectory prediction in dense and heterogeneous traffic using weighted interactions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8483–8492. Cited by: §1. 
[4]
(2018)
Stargan: unified generative adversarial networks for multidomain imagetoimage translation
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: item 1.  [5] (2019) Block neural autoregressive flow. arXiv preprint arXiv:1904.04676. Cited by: §2, §3.1.1, §3.1.2.
 [6] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2, §3.
 [7] (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §2.
 [8] (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §1.
 [9] (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §3.
 [10] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2, §5.2.
 [11] (2018) Neural autoregressive flows. In International Conference on Machine Learning, pp. 2083–2092. Cited by: item 2, §B, §3.1.2, §3, §5.1, §6.

[12]
(2017)
Imagetoimage translation with conditional adversarial networks
. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: item 1.  [13] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2, §3.
 [14] (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: item 2.
 [15] (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: item 2.
 [16] (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §1.
 [17] (2019) Interactionaware multiagent tracking and probabilistic behavior prediction via adversarial learning. arXiv preprint arXiv:1904.02390. Cited by: §1.

[18]
(2018)
An intriguing failing of convolutional neural networks and the coordconv solution
. In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: §5.2.  [19] (201806) Fast and furious: real time endtoend 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [20] (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2, §3.
 [21] (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: item 2, §2, Table S2, §D, §D.
 [22] (2019) PRECOG: prediction conditioned on goals in visual multiagent settings. arXiv preprint arXiv:1905.01296. Cited by: §1, Figure S1, §5.2, §5.2, §5.
 [23] (2019) Sophie: an attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1349–1358. Cited by: §1.
 [24] (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §3.
 [25] (2019) Multiple futures prediction. arXiv preprint arXiv:1911.00997. Cited by: §1.
 [26] (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: item 2.
A Ablation Study on the Virtual Simulator Dataset
In this section, we present the results of an ablation study conducted on the test set (10% of the dataset) of the Virtual Simulator Dataset to investigate the impact of different hypernetworks inputs on the POM forecasting performance. As mentioned in Section 4 and 5.1, each 5second snippet is divided into a 1second long history and a 4second long prediction horizons. All the inputs forming the conditions used in the ablation study are extracted from the history portion, and are listed below.

: historical states of the reference car in pixel coordinates,

: historical states of the actors excluding the reference car in pixel coordinates and up to 3 closest actors,

: stopsign locations in pixel coordinates, and

: bev images of size . may include all, or a subset of the following channels: stop signs at , street lanes at , reference car & actors images over all timesteps t=[1,0].
Conditions  
NLL  t = 2s  8.519  8.015  7.905  8.238  8.943  8.507 
t = 4s  6.493  6.299  6.076  6.432  7.075  6.839  
(best model)  
. The evaluation metric is negative loglikelihood. Lower values are better.
As presented in Table S1, we trained 6 distinct models to output for . The six models can be grouped into two different sets depending on the that was used. The first group is the models that do not utilize any bev map information, therefore . The second group leverages all bev images . Each group can be divided further, depending on whether a model uses and . The HCNAF model which takes all bev images from the perception model as the conditions (see Table 1) excluding the historical states of the actors and stopsigns is denoted by the term best model, as it reported the lowest NLL. We use to represent a model that takes as the conditions.
Note that the hypernetwork depicted in Figure 5
is used for the training and evaluation, but the components of the hypernetwork changes depending on the conditions. We also stress that the two modules of HCNAF (the hypernetwork and the conditional AF) were trained jointly. Since the hypernetwork is a regular neuralnetwork, it’s parameters are updated via backpropagations on the loss function.
As shown in Table S1, the second group () performs better than the first group (). Interestingly, we observe that the model performs better than . We suspect that this is due to using imperfect perception information. That is, not all the actors in the scene were detected and some actors are only partially detected; they appeared and disappeared over the time span of 1second long history. The presence of noncompliant, or abnormal actors may also be a contributing factor. When comparing and we see that the historical information of the surrounding actors did not improve performance. In fact, the model that only utilizes at time performs better than the one using across all timesteps. Finally, having the stopsign locations as part of the conditions is helping, as many snippets covered intersection cases. When comparing and , we observe that adding the states of actors and stopsigns in pixel coordinates to the conditions did not improve the performance of the network. We suspect that it is mainly due to the same reason that performs better than .
B Implementation Details on Toy Gaussian Experiments
For the toy gaussian experiment 1, we used the same number of hidden layers (2), hidden units per hidden layer (64), and batch size (64) across all autoregressive flow models AAF, NAF, and HCNAF. For NAF, we utilized the conditioner (transformer) with 1 hidden layer and 16 sigmoid units, as suggested in [11]. For HCNAF, we modeled the hypernetwork with two multilayer perceptrons (MLPs) each taking a condition and outputs W and B
. Each MLP consists of 1 hidden layer, a ReLU activation function. All the other parameters were set identically, including those for the Adam optimizer (the learning rate
decays by a factor of 0.5 every 2,000 iterations with no improvement in validation samples). The NLL values in Table 1 were computed using 10,000 samples.For the toy gaussian experiment 2, we used 3 hidden layers, 200 hidden units per hidden layer, and batch size of 4. We modeled the hypernetwork the same way we modeled the hypernetwork for the toy gaussian experiment 1. The NLL values in Table 2 were computed using 10,000 test samples from the target conditional distributions.
C Number of Parameters in HCNAF
In this section we discuss the computational costs of HCNAF with different model choices. We denote and as the flow dimension (the number of autoregressive inputs) and the number of hidden layers in a conditional AF. In case of , there exists only 1 hidden layer between and . We denote as the number of hidden units in each layer per flow dimension of the conditional AF. Note that the outputs of the hypernetwork are W and B. The number of parameters for W of the conditional AF is and that for B is .
The number of parameters in HCNAF’s hypernetwork is largely dependent on the scale of the hypernetwork’s neural network and is independent of the conditional AF except for the last layer of the hypernetwork as it is connected to W and B. The term represents the total number of parameters in the hypernetwork up to its th layer, where denotes the number of layers in the hypernetwork. is the number of hidden units in the th (the last) layer of the hypernetwork. Finally, the number of parameters for the hypernetwork is given by .
The total number of parameters in HCNAF is therefore a summation of , , and . The dimension grows quadratrically with the dimension of flow , as well as for . The key to minimizing the number of parameters is to keep the dimension of the last layer of the hypernetwork low. That way, the layers in the hypernetwork, except the last layer, are decoupled from the size of the conditional AF. This allows the hypernetwork to become large, as shown in the POM forecasting problem where the hypernetwork takes a few million dimensional conditions.
D Conditional Density Estimation on MNIST
The primary use of HCNAF is to model conditional probability distributions
when the dimension of (i.e., inputs to the hypernetwork of HCNAF) is large. For example, the POM forecasting task operates on largedimensional conditions with 1 million and works with small autoregressive inputs = 2. Since the parameters of HCNAF’s conditional AF grows quickly as increases (see Section C), and since the conditions greatly influence the hyperparameters of conditional AF module (Equation 4), HCNAF is illsuited for density estimation tasks with . Nonetheless, we decided to run this experiment to verify that HCNAF would compare well with other recent models. Table S2 shows that HCNAF achieves the stateofart performance for the conditional density estimation.MNIST is an example where the dimension of autoregressive variables ( = 784) is large and much bigger than = 1. MNIST images (size 28 by 28) belong to one of the 10 numeral digit classes. While the unconditional density estimation task on MNIST has been widely studied and reported for generative models, the conditional density estimation task has rarely been studied. One exception is the study of conditional density estimation tasks presented in [21]. In order to compare the performance of HCNAF on MNIST (), we followed the experiment setup from [21]
. It includes the dequantization of pixel values and the translation of pixel values to logit space. The objective function is to maximize the joint probability over
conditioned on classes of as follows.(16) 
Models  Conditional NLL  Bits Per Pixel 
Gaussian  1344.7  1.97 
MADE  1361.9  2.00 
MADE MoG  1030.3  1.39 
Real NVP (5)  1326.3  1.94 
Real NVP (10)  1371.3  2.02 
MAF (5)  1302.9  1.89 
MAF (10)  1316.8  1.92 
MAF MoG (5)  1092.3  1.51 
HCNAF (ours)  975.9  1.29 
For the evaluation, we computed the test loglikelihood on the joint probability as suggested in [21]. That is, = with , which is a uniform prior over the 10 distinct labels. Accordingly, the bits per pixel was converted from the LL in logit space to the bits per pixel as elaborated in [21].
For the HCNAF presented in Table S2, we used and for the conditional AF module. For the hypernetwork, we used , for W, for B, and 1dimensional label as the condition .
Comments
There are no comments yet.