HCNAF: Hyper-Conditioned Neural Autoregressive Flow and its Application for Probabilistic Occupancy Map Forecasting

12/17/2019 ∙ by Jean-Sebastien Valois, et al. ∙ University of Michigan Uber 14

We introduce Hyper-Conditioned Neural Autoregressive Flow (HCNAF); a powerful universal distribution approximator designed to model arbitrarily complex conditional probability density functions. HCNAF consists of a neural-net based conditional autoregressive flow (AF) and a hyper-network that can take large conditions in non-autoregressive fashion and outputs the network parameters of the AF. Like other flow models, HCNAF performs exact likelihood inference. We demonstrate the effectiveness and attributes of HCNAF, including its generalization capability over unseen conditions and show that HCNAF outperforms recent AF models in a conditional density estimation task for MNIST. We also show that HCNAF scales up to complex high-dimensional prediction problems of the magnitude of self-driving and that HCNAF yields a state-of-the-art performance in a public self-driving dataset.



There are no comments yet.


page 1

page 5

page 6

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The work presented in this paper is motivated by prediction problem in the context of autonomous driving. Prediction methods transform the history of perception data up to the current time step into a representation of how the environment will evolve over a short time horizon. The process is mired with challenges due to dynamic interactions, and the myriad of possibles events that ensue.

Figure 1: HCNAF used for probabilistic occupancy map (POM) forecasting, demonstrating the network’s use of high-dimensional conditions (). a) Inputs (conditions) are the spatio-temporal scene data. b) HCNAF consists of two neural-net based modules: a hyper-network and a conditional AF, . can take arbitrarily large inputs and produces the network parameters for , which produces the conditional probability precisely. c) Resulting POMs for agent vehicle centers at t=2 and t=4 secs.

Due to their predictive modeling abilities, deep learning models have emerged leveraging the power of RNNs for sequence predictions and CNNs to encode raw sensor data and prior scene features

[19, 2, 25, 16, 22, 23, 8, 3, 17]. In this regard, advanced predictions models exhibit the following characteristics:

  1. probabilistic

    : probability distributions reflecting future state uncertainties,

  2. multimodal: reproducing the rich diversity of states,

  3. context driven: support for interactive, multi-actor and contextual reasoning,

  4. efficient: result are produced to match the sensing rates, and

  5. general: capable of reasoning novel inputs in a stable and consistent way.

However, current prediction methods have limitations regarding the expression of uncertainty, multimodality, interaction, and/or from expensive sampling steps. We address those constraints using a novel method called Hyper-conditioned Neural Autoregressive Flow (HCNAF). HCNAF performs an exact likelihood inference by precisely computing probability of arbitrarily complex target distributions , with or without conditions . As an alternative to traditional trajectory predictions, we apply HCNAF to produce probabilistic occupancy maps (POMs) conditioned on the scene contexts (see figure 1). By directly obtaining exact probabilities on the POM, HCNAF removes the need to sample trajectories from distributions.

We present results on self-driving scenarios, but we first report results from density estimation tasks to investigate HCNAF’s generalization capability over diverse conditions.

2 Background

Flow, or normalizing flow, is a type of deep generative models, which aim to learn data distribution via the principle of maximum likelihood [7] so as to generate new data and/or estimate likelihood of a target distribution.

Flow-based models construct an invertible function between a latent variable

and a random variable

, which allows the computation of exact likelihood of an unknown data distribution using a known pdf

(e.g. normal distribution), via the change of variable theorem:


In addition, flow offers data generation capability by sampling latent variables and passing it through . As the accuracy of the approximation increases, the modeled pdf converges to the true and the quality of the generated samples also improves.

In contrast to other classes of deep generative models (namely VAE[13] and GAN[6]), flow is an explicit density model and offers unique properties:

  1. Computation of an exact probability, which is essential in the POM forecasting task. VAE infers using a computable term, Evidence Lower BOund (ELBO). However, it is unclear how ELBO can be used for tasks that require an exact probability computation for , and uncertain how well ELBO actually approximates as its upper bound is unknown. While GAN proved its power in generating high-quality samples for image generation and translation tasks[12, 4], it is unclear how the likelihood estimation and/or probability computation for the generated samples is obtained.

  2. The expressivity of flow-based models, which allows the models to capture complex data distributions. A recently published AF model called Neural Autoregressive Flow (NAF)[11] unified earlier AF models including [14, 21] by generalizing their affine transformations to arbitrarily complex non-linear monotonic transformations. Conversely, the default VAE uses unimodal Gaussians for the prior and the posterior distributions. In order to increase the expressivity of VAE, some have introduced more expressive priors [26] and posteriors [15, 1] that leveraged flow.

The class of invertible neural-net based autoregressive flows, including NAF and BNAF[5], is capable of approximating rich families of distributions, since it was shown that it is an universal approximator for continuous pdfs.

However, NAF and BNAF do not handle external conditions (e.g. classes or categories in the context of GAN vs cGAN[20]). That is, those models are designed to compute conditioned on previous inputs autoregressively to formulate . This formulation is not suitable for taking arbitrary conditions other than the autoregressive ones. This limits the extension of NAF to the applications that work with conditional probabilities , such as the POM forecasting problem.


[21] proposed which models affine flow transformations and models which additionally take external conditions. As shown in Equation 2, the transformation between and is affine and the influence of over the transformation relies on , , and stacking multiple flows. These may limit the contributions of to the transformation. This explains the needs for a conditional autoregressive flow that does not have such expressivity bottleneck.

Figure 2: HCNAF’s conditional AF model is a neural-net whose parameters are determined by a hyper-network . The dash lines refer to connections from to parameters of . Red lines between adjacent hidden layers , indicate invertible connections (i.e. strictly positive ). Green connections between adjacent flows , have no such constraint.

3 Hyper-Conditioned Neural Autoregressive Flow (HCNAF)

We propose Hyper-Conditioned Neural Autoregressive Flow (HCNAF), a novel autoregressive flow where a transformation between and

is modeled using a non-linear neural network

whose parameters are determined by arbitrarily complex conditions in non-autoregressive fashion, via a separate neural network . is designed to compute the parameters for

, thus being classified as an hyper-network


. HCNAF models a conditional joint distribution

autoregressively on , by factorizing it over conditional distributions .

NAF[11] and HCNAF both use neural networks but those are different in probability modeling, conditioner network structure, and the flow transformation function as elaborated below as:


In Equations 3, NAF uses a conditioner network to obtain the parameters for the transformation between and , which is parameterized by autoregressive conditions . In contrast, in Equations 4, HCNAF models the transformation to be parameterized on both , and an arbitrarily large external conditions in non-autoregressive fashion via the hyper-network . For probability modeling, the difference between the two is analogous to the difference between VAE[13] and conditional VAE[24], and that between GAN[6] and conditional GAN[20].

As illustrated in Figure 1, HCNAF consists of two main modules: 1) a neural-net based conditional autoregressive flow, and 2) a hyper-network which computes the parameters (i.e. weights, and biases) of 1). The modules are detailed in the following sub-sections.

3.1 NN-based Conditional Autoregressive Flow

The proposed conditional AF is a bijective neural-network , which models transformation between random variables and latent variables . The network parameters are determined by the hyper-network

. The main difference between regular feed-forward neural networks and flow models is the invertibility of

, whereas regular networks are not typically invertible.

The conditional AF is shown in Figure 2. In each dimension of the flow, the bijective transformation between and

are modeled with a multi-layer perceptron (MLP) with

hidden layers as follows:


The connection between two adjacent hidden layers and is defined as:


where subscript and superscript each denotes flow number and layer number. Specifically, is the hidden layer of the -th flow. and denote the weight matrix which defines contributions to the hidden layer of the -th flow from the hidden layer of the -th flow, and the bias matrix which defines the contributions to the hidden layer of the -th flow. Finally,

is an activation function.

The connection between and the first hidden layer, and between the last hidden layer and are defined as:


are the hidden units at the hidden layer across all flow dimensions and are expressed as:


where and are the weights and biases matrices at the hidden layer across all flow dimensions:


Likewise, W and B denote the weights and biases matrices for all flow dimensions across all the layers. Specifically, and .

Finally, is obtained by computing the terms from Equation 8 for all the network layers, from the first to the last layer, .

We designed HCNAF so that the hidden layer units are connected to the hidden units of previous layers , inspired by BNAF, as opposed to taking as inputs to a separate hyper-network to produce over , such as presented in NAF. This approach avoids running the hyper-network times; an expensive operation for large hyper-networks. By designing the hyper-network to output all at once, we reduce the computation load, while allowing the hidden states across all layers and all dimensions to contribute to the flow transformation, as is conditioned not only on , but also on all the hidden layers .

All Flow models must satisfy the following two properties: 1) monotonicity of to ensure its invertibility, and 2) tractable computation of the jacobian matrix determinant .

3.1.1 Invertibility of the Autoregressive Flow

The monotonicity requirement is equivalent to having , which is further factorized as:


where is expressed as:


denotes the pre-activation of . The invertibility is satisfied by choosing a strictly increasing activation function (e.g. tanh or sigmoid) and a strictly positive . is made strictly positive by applying an element-wise exponential to all entries in at the end of the hypernetwork, inspired by [5]. Note that the operation is omitted for the non-diagonal elements of .

3.1.2 Tractable Computation of Jacobian Determinant

The second requirement for flow models is to efficiently compute the jacobian matrix determinant , where:


Since we designed to be lower-triangular, the product of lower-triangular matrices, , is also lower-triangular, whose log determinant is then simply the product of the diagonal entries: , as our formulation states . Finally, is expressed via Equations 10 and 11.


Equation 13 involves the multiplication of matrices in different sizes; thus cannot be broken down to a regular log summation. To resolve this issue, we utilize log-sum-exp operations as it is commonly utilized in the flow community (e.g. NAF[11] and BNAF[5]) for numerical stability and efficiency of the computation. This approach to computing the jacobian determinant is similar to the one presented in BNAF, as our conditional AF resembles its flow model.

As HCNAF is a member of the monotonic neural-net based autoregressive flow family like NAF and BNAF, we rely on the proofs presented NAF and BNAF to claim that HCNAF is also a universal distribution approximator.

3.2 Hyper-conditioning and Training

The key point from Equation 5 - 13 and Figure 2 is that HCNAF is constraint-free when it comes to the design of the hyper-network. The flow requirements from Sections 3.1.1 and 3.1.2 do not apply to the hyper-network. This enables the hyper-network to grow arbitrarily large and thus to scale up with respect to the size of conditions. The hyper-network can therefore be an arbitrarily complex neural network with respect to the conditions .

We seek to learn the target distribution using HCNAF by minimizing the negative log-likelihood (NLL) of , i.e. the cross entropy between the two distributions, as in:


Note that minimizing the NLL is equivalent to minimizing the (forward) KL divergence between the data and the model distributions , as where is bounded.

4 Probabilistic Occupancy Map Forecasting

In Section 3, we showed that HCNAF can accommodate high-dimensional condition inputs for conditional probability density estimation problems. We leverage this capability to tackle the probabilistic occupancy map (POM) of actors in self-driving tasks. This problem operates on over one million dimensions, as spatio-temporal multi-actor images are part of the conditions. This section describes the design of HCNAF to support POM forecasting. We formulate the problem as follows:


where is the past states, with as the dimension of the observed state, over a time span . denotes the past states for all neighboring actors over the same time span. encodes contextual static and dynamic scene information extracted from map priors (e.g. lanes and stop signs) and/or perception modules (e.g. bounding boxes for actors) onto a rasterized image of size by with channels. However comprehensive, the list of conditions in is not meant to be limitative; as additional cues are introduced to better define actors or enhance context, those are appended to the conditions. We denote as the location of an actor over the 2D bev map at time , by adapting our conditional AF to operate on 2 dimensions. As a result, the joint probability is obtained via autoregressive factorization given by .

It’s possible to compute , a joint probability over multiple time steps via Equation 4, but we instead chose to compute (i.e. a marginal probability distribution over a single time step) for the following reasons:

  1. Computing implies the computation of autoregressively. While this formulation reasons about the temporal dependencies between the history and the future, it is forced to make predictions on dependent on unobserved variables and . The uncertainties of the unobserved variables have the potential to push the forecast in the wrong direction.

  2. The computation of is intractable in nature since it requires a marginalization over all variables . We note that is practically impossible to integrate over.

Note, we instead obtain POMs over all time by incorporating a time variable as part of the conditions, .

In addition to POMs, HCNAF can be used to sample trajectories using the inverse transformation . The exact probabilities of the generated trajectories can be computed via Equation 1. Even though HCNAF has capability of producing trajectory samples, we focused on POMs throughout the paper.

5 Experiments

In this section, we demonstrate the effectiveness of HCNAF on density estimation tasks for Toy Gaussians. We then verify the scalability of HCNAF by tackling more challenging POM forecasting problems for autonomous driving in simulated urban scenarios using two datasets: 1) Virtual Simulator: a dataset from simulated driving environments with diverse road geometries, including multiple road actors designed to mimic human drivers. The scenarios are based on real driving logs collected over North-American cities. 2) PRECOG-Carla

: a dataset created using the open-source Carla simulator for autonomous driving research. It was made publicly available in


5.1 Toy Gaussians

We conducted two experiments to demonstrate the performance of HCNAF for density estimations. The first was also used in the NAF paper [11], and aims to show the model’s learning ability for three distinct probability distributions over a 2D grid map, . The three non-linear distributions are distinct groups of gaussians over the grid. In the second test, we demonstrate how HCNAF can generalize its outputs for previously unseen conditions.

5.1.1 Toy Gaussians: Experiment 1

Figure 3:

Density estimation tasks using 3 gaussian distributions. In order to reproduce the probability distributions

, HCNAF uses a single model and 3 conditions, whereas NAF requires 3 different models, i.e. trained separately. In the figure, M: model and C: condition.
2 by 2 6.056 3.775 3.896
5 by 5 5.289 3.865 3.966
10 by 10 5.087 4.176 4.278
Table 1: NLL for the experiment depicted in Figure 3. Lower values are better.

Results from Figure 3 and Table 1 show that HCNAF is able to reproduce the three nonlinear target distributions, and to achieve comparable results as those using NAF, albeit with a small increase in NLL. We emphasise that HCNAF uses a single model (with a 1-dimensional condition variable) to produce the three distinct pdfs, whereas AAF (Affine AF) and NAF used three distinctly trained models. The autoregressive conditioning applied in HCNAF is the same as for the other two models. The hyper-network of HCNAF uses where each value represents a class of 2-by-2, 5-by-5, and 10-by-10 gaussians.

5.1.2 Toy Gaussians: Experiment 2

From the density estimation experiment shown in Figure 2, we observed that HCNAF is capable of generalization over unseen

conditions, i.e. values in the condition terms that were intentionally omitted during training. The experiment was designed to verify that the model would interpolate and/or extrapolate probability distributions beyond the set of conditions it was trained with, and to show how effective HCNAF is at reproducing both the target distribution

for . As before, we trained a single HCNAF model to learn 5 distinct pdfs, where each pdf represents a gaussian distribution with its mean (center of the 2D gaussian) used as conditions

and with an isotropic standard deviation

of 0.5.

Figure 4: 222The source code for the toy gaussian experiments will be available online soon.HCNAF model trained with 5 different discrete conditions , where represents the mean of an isotropic bivariate gaussian pdf. a) , b) c) predictions on previously unseen conditions , .
1.452 - -
- 1.489 1.552
- 0.037 0.100
Table 2: Differences between the target and predicted distributions in terms of cross entropy and KL divergence for Figure 2.

For this task, the objective function is the maximization of log-likelihood, which is equivalent to the maximization of the KL divergence where is uniformly sampled from the set of conditions . Table 2 provides quantitative results from the cross entropy and a KL divergence . Note that is lower-bounded by since . The differential entropy of an isotropic bi-variate gaussian distribution and is computed using: . The results show that HCNAF is able to generalize its predictions for unseen conditions as shown by the small deviation of from its lower bound .

5.2 Forecasting POM for Autonomous Driving

In this section, we show how HCNAF can be scaled up to tackle the POM forecasting problems for autonomous driving. The condition for the POM prediction problems is significantly larger when compared to that from the experiments in Section 5.1, as shown in Equation 15. now includes information extracted from various sensors (lidar, camera), maps (lanes, stop-signs), and perception detections (expressed as bounding boxes for actors), whose dimension is typically summed up to millions. As per its design, HCNAF is unaffected by the increase in condition dimensions, as the hyper-network can easily be extended to support any new parameters, (memory allowing).

Figure 5: Design for the Hyper-network of HCNAF used in the POM forecasting problem.

Figure 5 depicts the customized hyper-network for the POM forecasting for autonomous vehicles. The hyper-network takes perception inputs as the condition , and outputs a set of network parameters W and B for the subsequent HCNAF’s conditional AF . The inputs come from various sensors (lidar or camera) through a perception module and also from prior map information. Specifically, is formed with 1) the bev images which include lanes, stop-signs, and actors depicted as bounding boxes in a 2D grid map (see Figure 6), and 2) the pose of actors in actor-centric pixel coordinates. The perception module used reflects other standard approaches for processing multi-sensor data, such as [10]. The hyper-network consists of three main components: 1) LSTM modules, 2) an encoder module, and 3) a time module. The outputs of the three modules are concatenated and fed into an MLP, which outputs W and B, as shown in Figure 5.

The LSTM module takes the states of an actor in the scene where to encode temporal dependencies and trends among the state parameters. A total of LSTM modules are used to model the actors and the reference car for which we produce the POM forecasts. The resulting outputs are , and .

The encoder module takes in the bev images denoted as

. The role of this module is to transform the scene contexts into a one-dimensional tensor that is concatenated with other parameters of our conditional AF flow module. We use residual connections to enhance the performance of our encoder as in

[10]. Since our hyper-network works with Cartesian (x,y) space and pixel (image) space, we use coordinate convolution (coordconv) layers as in [18] to strengthen the association between the two data. Overall, the encoder

network consists of 4 encoder blocks, and each encoder block consists of 5 coordconv layers with residual connections, max-pooling layers, and batch-normalization layers. The resulting output is


Lastly, the time layer adds the forecasting time , i.e. time span of the future away from the reference (or present) time . In order to increase the contribution of the time condition, we apply an MLP which outputs a hidden variable for the time condition .

Forecasting POM with a Virtual Simulator Dataset

Figure 6: HCNAF for forecasting POMs on our Virtual Simulator dataset. Left: one-second history of actors (green) and ref. car (blue). Actors are labeled as . Center and right: occupancy prediction for actor centers , , at = 2 and 4 secs., with actor ground truth overlayed. Note that actors may enter and exit the scene. In example 1, our forecasts captured the speed variations of , the stop line deceleration and the multi-modal movements (left/right turns, straight) of , and the stop line pausing of . In Example 2, HCNAF predicts coming to a stop, and exiting the intersection before , while is yielding to . Finally, example 3 shows that HCNAF predicted the speed variations along a stretch of road for .

HCNAF with the custom hyper-network was trained on an internal dataset that we call Virtual Simulator. The dataset is comprised of bev images of size , where may include all, or a subset of the following channels: stop signs, street lanes, reference car locations, and a number of actors. We also add the history of actor states in pixel coordinates, as discussed in the previous sub-section. For each of the vehicles/actors, we apply a coordinate transformation to obtain actor-centric labels and images for training. The vehicle dataset includes parked vehicles and non-compliant road actors to introduce common and rare events (e.g. sudden lane changes or sudden stopping in the middle of the roads). We produce POM for all visible vehicles, including parked vehicles, and non-compliant actors, even if those are not labeled as such. Note that the dataset was created out of several million examples, cut into snippets of 5 seconds in duration. Figure 6 depicts POM forecast for three scenarios sampled from the test set. An ablation study on the Virtual Simulator shows the impact of different hyper-networks inputs on the POM forecasting accuracy and is presented in the supplementary materials.

As discussed in Section 4, HCNAF produces not only POM, but also trajectory samples via the inverse transformation of the conditional AF . As we advocate the POM approach, we do not elaborate further on the trajectory based approach using HCNAF.

Forecasting POM with PRECOG-Carla Dataset

We trained HCNAF on the PRECOG-Carla Town01-train dataset and validated the progress over Town01-val dataset publicly available [22]. The hyper-network used for this experiment was identical to one used for the Virtual Simulator dataset, except that we substituted the bev images with two of the raw overhead lidar channels; the above ground and ground level inputs. The encoder module input layer was updated to process the lidar image size (200x200) of the PRECOG-Carla dataset. In summary, included the lidar data, and the history of the reference car and other actors.

To evaluate the performance of the trained models, [22] used the extra nats metric for the likelihood estimation instead of NLL. is a normalized, bounded likelihood metric defined as , where each represents the cross-entropy between (perturbed with an isotropic gaussian noise) and [22], prediction horizon, number of actors, and dimension of the actor position. We used the same as cited, whose differential entropy is analytically obtained using . We computed over all time-steps available in the dataset. The results are presented in Table 3.

Method Test (): Lower is better
PRECOG-ESP, no lidar 0.699
HCNAF, no lidar (ours) 0.184
HCNAF (ours) 0.114 (5+ times lower)
Table 3: PRECOG-CARLA Town01 Test, 1 agent, mean

We believe that HCNAF performed better than PRECOG-ESP, which is a state-of-the-arts prediction model in autonomous driving, by taking advantage of the non-linear flow transformation and having condition terms affecting the hidden states of all layers of the HCNAF’s neural-net based AF. Note, PRECOG utilizes bijective transformations that is rooted in affine autoregressive flow, similar to cMAF (See Equation 2). We also believe that the HCNAF’s generalization capability is a contributing factor that explains how HCNAF is able to estimate probability densities conditioned on previously unseen contexts.

6 Conclusion

We present HCNAF, a novel universal distribution approximator tailored to model conditional probability density functions. HCNAF extends neural autoregressive flow [11] to take arbitrarily large conditions, not limited to autoregressive conditions, via a hyper-network which determines the network parameters of HCNAF’s AF. By modeling the hyper-network constraint-free, HCNAF enables it to grow arbitrarily large and thus to scale up with respect to the size of non-autoregressive conditions. We demonstrate its effectiveness and capability to generalize over unseen conditions on density estimation tasks. We also scaled HCNAF’s hyper-network to handle larger conditional terms as part of a prediction problem in autonomous driving.


  • [1] R. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling (2018) Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649. Cited by: item 2.
  • [2] S. Casas, W. Luo, and R. Urtasun (2018-29–31 Oct) IntentNet: learning to predict intention from raw sensor data. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 87, , pp. 947–956.
    External Links: Link Cited by: §1.
  • [3] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha (2019) Traphic: trajectory prediction in dense and heterogeneous traffic using weighted interactions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8483–8492. Cited by: §1.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: item 1.
  • [5] N. De Cao, I. Titov, and W. Aziz (2019) Block neural autoregressive flow. arXiv preprint arXiv:1904.04676. Cited by: §2, §3.1.1, §3.1.2.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2, §3.
  • [7] I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §2.
  • [8] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §1.
  • [9] D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §3.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2, §5.2.
  • [11] C. Huang, D. Krueger, A. Lacoste, and A. Courville (2018) Neural autoregressive flows. In International Conference on Machine Learning, pp. 2083–2092. Cited by: item 2, §B, §3.1.2, §3, §5.1, §6.
  • [12] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: item 1.
  • [13] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2, §3.
  • [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: item 2.
  • [15] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: item 2.
  • [16] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §1.
  • [17] J. Li, H. Ma, and M. Tomizuka (2019) Interaction-aware multi-agent tracking and probabilistic behavior prediction via adversarial learning. arXiv preprint arXiv:1904.02390. Cited by: §1.
  • [18] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski (2018)

    An intriguing failing of convolutional neural networks and the coordconv solution

    In Advances in Neural Information Processing Systems, pp. 9605–9616. Cited by: §5.2.
  • [19] W. Luo, B. Yang, and R. Urtasun (2018-06) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [20] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2, §3.
  • [21] G. Papamakarios, T. Pavlakou, and I. Murray (2017) Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: item 2, §2, Table S2, §D, §D.
  • [22] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine (2019) PRECOG: prediction conditioned on goals in visual multi-agent settings. arXiv preprint arXiv:1905.01296. Cited by: §1, Figure S1, §5.2, §5.2, §5.
  • [23] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese (2019) Sophie: an attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1349–1358. Cited by: §1.
  • [24] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §3.
  • [25] Y. C. Tang and R. Salakhutdinov (2019) Multiple futures prediction. arXiv preprint arXiv:1911.00997. Cited by: §1.
  • [26] A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: item 2.

A Ablation Study on the Virtual Simulator Dataset

In this section, we present the results of an ablation study conducted on the test set (10% of the dataset) of the Virtual Simulator Dataset to investigate the impact of different hyper-networks inputs on the POM forecasting performance. As mentioned in Section 4 and 5.1, each 5-second snippet is divided into a 1-second long history and a 4-second long prediction horizons. All the inputs forming the conditions used in the ablation study are extracted from the history portion, and are listed below.

  1. : historical states of the reference car in pixel coordinates,

  2. : historical states of the actors excluding the reference car in pixel coordinates and up to 3 closest actors,

  3. : stop-sign locations in pixel coordinates, and

  4. : bev images of size . may include all, or a subset of the following channels: stop signs at , street lanes at , reference car & actors images over all time-steps t=[-1,0].

NLL t = 2s -8.519 -8.015 -7.905 -8.238 -8.943 -8.507
t = 4s -6.493 -6.299 -6.076 -6.432 -7.075 -6.839
(best model)
Table S1: Ablation study on Virtual Simulator

. The evaluation metric is negative log-likelihood. Lower values are better.

As presented in Table S1, we trained 6 distinct models to output for . The six models can be grouped into two different sets depending on the that was used. The first group is the models that do not utilize any bev map information, therefore . The second group leverages all bev images . Each group can be divided further, depending on whether a model uses and . The HCNAF model which takes all bev images from the perception model as the conditions (see Table 1) excluding the historical states of the actors and stop-signs is denoted by the term best model, as it reported the lowest NLL. We use to represent a model that takes as the conditions.

Note that the hyper-network depicted in Figure 5

is used for the training and evaluation, but the components of the hyper-network changes depending on the conditions. We also stress that the two modules of HCNAF (the hyper-network and the conditional AF) were trained jointly. Since the hyper-network is a regular neural-network, it’s parameters are updated via back-propagations on the loss function.

As shown in Table S1, the second group () performs better than the first group (). Interestingly, we observe that the model performs better than . We suspect that this is due to using imperfect perception information. That is, not all the actors in the scene were detected and some actors are only partially detected; they appeared and disappeared over the time span of 1-second long history. The presence of non-compliant, or abnormal actors may also be a contributing factor. When comparing and we see that the historical information of the surrounding actors did not improve performance. In fact, the model that only utilizes at time performs better than the one using across all time-steps. Finally, having the stop-sign locations as part of the conditions is helping, as many snippets covered intersection cases. When comparing and , we observe that adding the states of actors and stop-signs in pixel coordinates to the conditions did not improve the performance of the network. We suspect that it is mainly due to the same reason that performs better than .

B Implementation Details on Toy Gaussian Experiments

For the toy gaussian experiment 1, we used the same number of hidden layers (2), hidden units per hidden layer (64), and batch size (64) across all autoregressive flow models AAF, NAF, and HCNAF. For NAF, we utilized the conditioner (transformer) with 1 hidden layer and 16 sigmoid units, as suggested in [11]. For HCNAF, we modeled the hyper-network with two multi-layer perceptrons (MLPs) each taking a condition and outputs W and B

. Each MLP consists of 1 hidden layer, a ReLU activation function. All the other parameters were set identically, including those for the Adam optimizer (the learning rate

decays by a factor of 0.5 every 2,000 iterations with no improvement in validation samples). The NLL values in Table 1 were computed using 10,000 samples.

For the toy gaussian experiment 2, we used 3 hidden layers, 200 hidden units per hidden layer, and batch size of 4. We modeled the hyper-network the same way we modeled the hyper-network for the toy gaussian experiment 1. The NLL values in Table 2 were computed using 10,000 test samples from the target conditional distributions.

C Number of Parameters in HCNAF

In this section we discuss the computational costs of HCNAF with different model choices. We denote and as the flow dimension (the number of autoregressive inputs) and the number of hidden layers in a conditional AF. In case of , there exists only 1 hidden layer between and . We denote as the number of hidden units in each layer per flow dimension of the conditional AF. Note that the outputs of the hyper-network are W and B. The number of parameters for W of the conditional AF is and that for B is .

The number of parameters in HCNAF’s hyper-network is largely dependent on the scale of the hyper-network’s neural network and is independent of the conditional AF except for the last layer of the hyper-network as it is connected to W and B. The term represents the total number of parameters in the hyper-network up to its th layer, where denotes the number of layers in the hyper-network. is the number of hidden units in the th (the last) layer of the hyper-network. Finally, the number of parameters for the hyper-network is given by .

The total number of parameters in HCNAF is therefore a summation of , , and . The dimension grows quadratrically with the dimension of flow , as well as for . The key to minimizing the number of parameters is to keep the dimension of the last layer of the hyper-network low. That way, the layers in the hyper-network, except the last layer, are decoupled from the size of the conditional AF. This allows the hyper-network to become large, as shown in the POM forecasting problem where the hyper-network takes a few million dimensional conditions.

D Conditional Density Estimation on MNIST

The primary use of HCNAF is to model conditional probability distributions

when the dimension of (i.e., inputs to the hyper-network of HCNAF) is large. For example, the POM forecasting task operates on large-dimensional conditions with 1 million and works with small autoregressive inputs = 2. Since the parameters of HCNAF’s conditional AF grows quickly as increases (see Section C), and since the conditions greatly influence the hyper-parameters of conditional AF module (Equation 4), HCNAF is ill-suited for density estimation tasks with . Nonetheless, we decided to run this experiment to verify that HCNAF would compare well with other recent models. Table S2 shows that HCNAF achieves the state-of-art performance for the conditional density estimation.

MNIST is an example where the dimension of autoregressive variables ( = 784) is large and much bigger than = 1. MNIST images (size 28 by 28) belong to one of the 10 numeral digit classes. While the unconditional density estimation task on MNIST has been widely studied and reported for generative models, the conditional density estimation task has rarely been studied. One exception is the study of conditional density estimation tasks presented in [21]. In order to compare the performance of HCNAF on MNIST (), we followed the experiment setup from [21]

. It includes the dequantization of pixel values and the translation of pixel values to logit space. The objective function is to maximize the joint probability over

conditioned on classes of as follows.

Models Conditional NLL Bits Per Pixel
Gaussian 1344.7 1.97
MADE 1361.9 2.00
MADE MoG 1030.3 1.39
Real NVP (5) 1326.3 1.94
Real NVP (10) 1371.3 2.02
MAF (5) 1302.9 1.89
MAF (10) 1316.8 1.92
MAF MoG (5) 1092.3 1.51
HCNAF (ours) 975.9 1.29
Table S2: Test negative log-likelihood (in nats, logit space) and bits per pixel for the conditional density estimation task on MNIST. Lower values are better. Results from models other than HCNAF were found in [21]. HCNAF is the best model among the conditional flow models listed.

For the evaluation, we computed the test log-likelihood on the joint probability as suggested in [21]. That is, = with , which is a uniform prior over the 10 distinct labels. Accordingly, the bits per pixel was converted from the LL in logit space to the bits per pixel as elaborated in [21].

For the HCNAF presented in Table S2, we used and for the conditional AF module. For the hyper-network, we used , for W, for B, and 1-dimensional label as the condition .

E Detailed Evaluation Results and Visualization of POMs for PRECOG-Carla Dataset

Figure S1: Detailed the evaluation results on the PRECOG-Carla test set per time-step for the HCNAF and HCNAF(No lidar) models that are described in Table 3. AVG in the plot indicates the averaged extra nats of a model over all time-steps (i.e., ). Note that the x-axis time steps are 0.2 seconds apart, so corresponds to t = 4sec into the future and that there is no upper bound of as . As expected, the POM forecasts are more accurate (closer to the target distribution ) at earlier time-steps, as the uncertainties grow over time. For all time-steps, the HCNAF model with lidar approximates the target distribution better than the HCNAF model without lidar. Both with and without lidar, HCNAF outperforms a state-of-the-art prediction model, PRECOG-ESP [22].
Figure S2: POM forecasts on PRECOG-Carla dataset using the HCNAF model described in Table 3 (with lidar). Left: 2 seconds history of cars. Center and right: probabilistic occupancy predictions for Car 1 at = 2 and 4 secs depicted as red heatmaps, with actor ground truth (blue square) overlayed. Note that we only forecast POMs for the car 1 as the lidar data is only available for the car 1. In the example 1, where the car 1 enters a 3-way intersection, our forecasts captured the two natural options (left-turn & straight) and depict the probabilities of every possible positions as heatmaps. In example 2, HCNAF uses the curved road geometry and successfully forecasts the occupancy probabilities of the car 1. The example 3 depicts a 3-way intersection with a queue formed by tow other cars in front of car 1. HCNAF uses the interactions coming from the front cars and correctly forecast that car 1 is likely to stop due to other vehicles in front if it. In addition, our model captures possibilities of the queue resolved at t=4sec and accordingly predicts occupancy at the tail. The example 4 illustrates the car 1 that just started to turn left as it enters the 3-way intersection. The POM forecast for = 4 is an ellipse with a longer lateral axis which reflects the higher uncertainty in the later position of the car 1 after the turning.
Figure S3: POM forecasts on PRECOG-Carla dataset (cont.). In the examples 5 and 7, the car 1 enters 3-way intersections, HCNAF uses the road geometry coming from the lidar data and correctly forecasts that there are two modes (left-turn & right-turn). The example 6 POM shows that HCNAF forecast the multi-modal distribution (straight & right-turn) successfully. The example 8 depicts a car traveling in high-speed. The POM forecasts are wide-spread along with the longitudinal axis. Finally, the example 9 elaborates a car entering a 4-way intersection in high-speed. HCNAF takes account of the fact that the car 1 has been traveling in high-speed and predicts that it is unlikely to turn left or right, forecasting the car 1 to pass through the intersection.