A non-homogeneous Poisson process (NHPP) is a Poisson point process that has variable intensity in the domain on which it is defined.
NHPPs are commonly used in a wide range of applications, for example when modeling failures of repairable systems (Lindqvist2006), earthquake occurrence (Hong1995), or the evolution of customer purchase behavior (Letham2016).
A NHPP defined on can be fully characterized through its intensity function . The intensity function is usually of considerable scientific interest, and both parametric and nonparametric methods have been proposed to model it. A parametric approach assumes that the intensity function has a known parametric form, and that the model parameters can be estimated using, for example, likelihood-based methods (e.g., Zhao1996). The specified functional form is, however, often too restrictive an assumption in practice. Non-parametric techniques, on the other hand, do not fix the functional form of the intensity function. Methods in this class for modeling the intensity function include ones that are spline-based (Dias2008), wavelet-based (Kolaczyk1999; Miranda2011), and kernel-based (Diggle1985). While non-parametric methods offer greater modeling flexibility, they often do not scale well with the number of observed points or the dimension .
Bayesian methods can be adopted for intensity function estimation if one has prior knowledge (e.g., on the function’s smoothness) that could be used. This prior knowledge is often incorporated by treating the intensity function as a latent stochastic process; the resulting model is called a doubly-stochastic Poisson process, or Cox process (Moller1998)
. One popular variant of the Cox process is the trans-Gaussian Cox process, where a transformation of the intensity function is a Gaussian process (GP). Inference for such models typically requires Markov chain Monte Carlo methods(Adams2009), which scale poorly with the number of observed points and dimension . Approximate Bayesian methods such as variational inference (Zammit2011; Lloyd2015), or Laplace approximations (Illian2012), often impose severe, and sometimes inadequate, restrictions on the functional form of the posterior distributions.
The models for the intensity function discussed above either place assumptions on the intensity function that are overly restrictive, or require computational methods that are inefficient, in the sense that they do not scale well with data size and/or the dimension
. Here, we present a new model for the intensity function that overcomes both limitations. The model finds its roots in transportation of probability measure(Marzouk2016)
, an approach that has gained popularity recently for its ability to model arbitrary probability density functions. The basic idea of this approach is to construct a “transport map” between the complex, unknown, intensity function of interest, and a simpler, known, reference intensity function.
We use a map that is sufficiently complex for it to approximate arbitrary intensity functions on subsets of , and one that is easy to fit to observational data. Specifically, we construct a transport map through compositions of several simple increasing triangular maps (Marzouk2016), in a procedure sometimes referred to as map stacking (Papamakarios2017). Our model has the “universal property” (Hornik1989), in the sense that a large class of intensity functions can be approximated arbitrarily well using this approach. We estimate the parameters in the map using an optimization framework wherein computations are carried out efficiently on graphics processing units using recent technological advances in deep learning. We also develop a technique to efficiently generate a realization from the fitted point process, and a nonparametric bootstrap approach (Efron1981) to quantify uncertainties on the estimated intensity function via the stack of increasing triangular maps.
The article is organized as follows. Section 2 establishes the notation and the required theoretical background on transportation of probability measures, while Section 3 presents our proposed method for intensity function modeling and estimation of NHPPs. Results from simulation and real-application experiments are given in Section 4. Section 5 concludes. Additional technical material is provided in Appendix A.
2 Transportation of Probability Measure
Our methodology for intensity function modeling in Section 3 is based on measure transport, and techniques that enable it for density estimation. In Section 2.1 we briefly describe the optimal transport problem, and how this leads to the class of increasing triangular maps. In Section 2.2 we discuss parameterizations of increasing triangular maps and the one we choose in our approach to modeling the intensity function, while in Section 2.3 we briefly discuss the composition of such maps in a deep learning framework.
2.1 Optimal Transport and Increasing Triangular Maps
Consider two probability measures and defined on and , respectively. A transport map is said to push forward to (written compactly as ) if and only if
The inverse is treated in the general set valued sense, that is, if . If is injective, then the relationship in (1) can also be expressed as
A transport map satisfying (1) represents a deterministic coupling of the probability measures and . An alternative interpretation of the transport map is that if
is a random vector distributed according to the measure, then is distributed according to .
Suppose , and that both and are absolutely continuous with respect to the Lebesgue measure on , with densities and , respectively. Furthermore, assume that the map is bijective with a differentiable inverse (i.e., assume that is a diffeomorphism), then (2) is equivalent to
The conditions under which the map exists are established in Brenier1991 and Mccann1995. Of particular note is that is guaranteed to exist when both and are absolutely continuous. There may exist an infinitely many transport maps that satisfy (1). The optimal transport problem (Villani2009) is concerned with choosing a particular map satisfying (1) such that the total cost of transportation is minimized.
The optimal transport problem is commonly expressed using either Monge’s or Kantorovich’s formulation. Kantorovich’s formulation, which we now briefly describe, generalizes Monge’s, and avoids some of the theoretical issues that arise from that formulation (they are equivalent under mild conditions; see Santambrogio2010). Let be a transport plan where represents the amount of “mass” to be moved from to , and note that any transport map can be represented using a corresponding transport plan . Let be the cost function such that represents the cost of moving to . The Kantorovich’s formulation consists of finding a transport plan such that the total cost of transportation, , is minimized.
Consider absolute continuous measures and defined on , respectively, and suppose that the cost function has the following weighted quadratic form,
where , , and the weights are positive scalars that depend on a parameter . Carlier2009 prove that the corresponding optimal transportation problem leads to a unique solution for the transport map, . Further, if for all one has that as (e.g., if ), then in as , where is the Knothe–Rosenblatt rearrangement (Rosenblatt1952; Knothe1957),
where, for one has that is monotonically increasing in . The map in (4) is said to have an increasing triangular structure. In particular, the Jacobian matrix of an increasing triangular map, if it exists, is triangular with positive entries on its diagonal. Due to its connection with optimal transport, and because of its structure that leads to efficient computations, we will be exclusively considering this class of maps in the following sections.
2.2 Parameterization of Increasing Triangular Maps
Various approaches to parameterize an increasing triangular map have been proposed (see, for example, Germain2015; Dinh2015; Dinh2017). One class of parameterizations is based on the so-called conditional networks (Papamakarios2017; Huang2018). Consider for now a map comprising just one increasing triangular map, which we denote as (we will later consider many of these in composition), and let . When using conditional networks, the increasing triangular map has the following form:
for , where is the th conditional network that takes as inputs and is parameterized by , and is generally a very simple univariate function of , but with parameters that depend in a relatively complex manner on through the conditional network. Therefore, , where is the number of parameters that parameterize . It is often the case that feedforward neural networks are used as the conditional networks (Fine1999). For ease of exposition, from now on we will slightly abuse the notation and denote simply as , thereby omitting the explicit dependence on the inputs and the parameters .
One class of maps using conditional networks is that of masked autoregressive flows (Papamakarios2017). In this class, , and the output of the conditional network parameterizes as
In (6), the univariate function is a linear function of with location parameter and scale parameter . Monotonicity of , and hence of , is ensured since . Another class of such maps is the class of inverse autoregressive flows, proposed by Kingma2016. In this class, and
is the sigmoid function. In this class of maps, eachoutputs the weighted average of and , with the weights given by and , respectively. Monotonicity of , and hence of , is ensured since .
Both (6) and (7) are generally too simple for modeling density functions or, in our case, intensity functions. In this work we therefore focus on the class of neural autoregressive flows, proposed by Huang2018, which are more flexible. In this class, for and , and the -th component of the map has the form
is the logit function,, , and is subject to the constraint . As with the other two maps discussed above, monotonicity of , and hence of
, is ensured through this construction. The Jacobian of the neural autoregressive flow can be computed using the chain rule since the gradient of bothand are analytically available; this is important for computational purposes since such formulations can easily be handled using automatic differentiation libraries.
Each univariate function in the neural autoregressive flow comprises two sets of smooth, nonlinear transforms: sigmoid functions that map from to the unit interval, and one logit function that maps from the unit interval to . The complexity/flexibility of is largely determined by
. Note that the neural autoregressive flow has a very similar form to the conventional feedforward neural network with sigmoid activation functions.
A natural question to ask is how well an arbitrary density function can be approximated by a density constructed using the neural autoregressive flow. It has been shown that the neural autoregressive flow satisfies the ‘universal property’ for the set of positive continuous probability density functions, in the sense that any target density that satisfies mild smoothness assumptions can be approximated arbitrarily well (Huang2018). We restate the theorem in the context of process densities of NHPPs (defined in Section 3.1), and provide an alternative proof for this property, in Section 3.3.
2.3 Composition of Increasing Triangular Maps
It is well known that a neural network with one hidden layer can be used to approximate any continuous function on a bounded domain arbitrarily well (Hornik1989; Cybenko1989; Barron1994). However, the size of a single layer network (in terms of the number of parameters) that may be required to achieve a desired level of function approximation accuracy may be prohibitively large. This is important as, despite the universal property of the neural autoregressive flow, both the conditional network and the univariate function in (8) may need to be made very complex in order to approximate a target density up to a desired level of accuracy. Specifically, , as well as the number of parameters appearing in the conditional networks, , might be prohibitively large.
Neural networks with many hidden layers, known as deep nets, tend to have faster convergence rates to the target function compared to shallow networks (Eldan2016; Weinan2018). In our case, layering several relatively parsimonious triangular maps through composition is an attractive way of achieving the required representational ability while avoiding an explosion in the number of parameters. Specifically, we let , where , are increasing triangular maps of the form given in (2.2), parameterized using neural autoregressive flows.
The composition does not break the required bijectivity of , since a bijective function of a bijective function is itself bijective. Computations also remain tractable, since the determinant of the gradient of the composition is simply the product of the determinants of the individual gradients. Specifically, consider two increasing triangular maps and , each constructed using the neural network approach described above. The composition is bijective, and its gradient at some has determinant,
Further, since the maps have a triangular structure, the Jacobian at some point is a triangular matrix, for which the determinant is easy to compute. The determinant of the composition evaluated at is hence also easy to compute.
3 Intensity Modeling and Estimation via Measure Transport
Consider a NHPP defined on a bounded domain , and let be the stochastic process characterizing , where is the number of events in . A NHPP defined on is completely characterized by its intensity function , such that where is the corresponding intensity measure. If is constant, the Poisson process is homogeneous. In this section we present our approach for modeling and estimating from observational data.
3.1 The Optimization Problem
The density of a Poisson process does not exist with respect to the Lebesgue measure. It is therefore customary to instead consider the density of the NHPP of interest with respect to the density of the unit rate Poisson process, that is, the process with . We denote the resulting density as . Let denote the Lebesgue measure of a bounded set , and let where and be a realization of . The density function evaluated at is given by,
Our objective is to estimate the unknown intensity function that generates the data . A commonly employed strategy is to estimate using maximum likelihood. It is well known that maximizing the likelihood is equivalent to minimizing the Kullback–Leibler (KL) divergence between the true density and the estimate. For two probability densities and , the KL divergence is defined as . We therefore estimate the unknown intensity function , which we denote as , as follows,
where is some set of intensity functions, and is the density of a NHPP with intensity function taken with respect to the density of the unit rate Poisson process.
To solve the optimization problem defined in (10), we first derive the following expression for the KL divergence between two arbitrary densities.
KLConsider two Poisson processes , on with intensity functions and , respectively. Denote the corresponding densities with respect to the unit rate Poisson process as and is:
We give a proof for Proposition 3.1 in Appendix A.1.
In order to apply the measure transport approach to intensity function estimation, we first define and so that and are valid density functions with respect to Lebesgue measure. In particular, and are termed process densities by Taddy2010, which can be modeled separately from the integrated intensities and , respectively. The KL divergence can be written in terms of process densities as follows,
where “const.” captures other terms not dependent on or . This formulation allows us to model the integrated intensity , and the density separately. The same approach was also adopted by Taddy2010 where they developed a nonparametric Dirichlet process mixtures framework for intensity function estimation. The integral and are not analytically available since the true intensity function is unknown. However, treating this integral as an expectation, we see that, for reasonably large , it can be approximated by
where recall that is the (observed) point-process realization under the true intensity function . Similarly, by Poissonicity of the NHPP, is sufficient for , and therefore we approximate the integrated intensity as
Using the process-density representation of the intensity function, and the Monte Carlo approximations (12) and (13), we re-express the optimization problem (10) in terms of the estimate of the integrated intensity, , and the estimated process density ,
where now is some set of process densities, which we will establish in Section 3.2. It is easy to see that setting minimizes the objective function in (14). Fixing leads us to the optimization problem
which is equivalent to maximizing the likelihood of observing .
3.2 Modeling the Process Density via Probability Measure
We model the process density using the transportation of probability measure approach described in Section 2. Specifically, we seek a diffeomorphism , where need not be the same as , such that
where is some simple reference density on , and . Popular choices of
include the standard normal distribution if
is unbounded, and the uniform distribution ifis bounded.
While the domain and the range of the map can be arbitrary subsets of , it is generally easier to construct transport maps from to . The domain is bounded, and therefore we can assume, without loss of generality, that , and we first apply an element-wise logit transform to each coordinate of the vector to obtain the vector , where . The Jacobian of this transformation is given by . We subsequently construct the transport map as a composition of increasing triangular maps (see Section 2.3). Each triangular map in the composition is parameterized using a conditional network approach as detailed in Section 2.2. Specifically, we adopt the neural autoregressive flow where the th component of each triangular map is modeled as in (8).
Denote the parameters appearing in the th conditional network associated with the th layer as and let . The optimization problem (15) reduces to the optimization problem:
The optimization problem can be solved efficiently using automatic differentiation libraries, stochastic gradient descent, and graphics processing units for efficient computation. We usedPyTorch for our implementation (Paszke2017) and adapted the code provided by Huang2018.
3.3 Universal Approximation
The increasing triangular maps constructed using neural autoregressive flows (8) satisfy a universal property in the context of probability density approximation. This universal approximation property naturally applies to the process density of a Poisson process. One need only prove this property for the case of one triangular map since, if two maps have the universal property, their composition also has the universal property.
Theorem1Let be a non-homogeneous Poisson process with positive continuous process density on with corresponding measure . Let be a probability measure on with a positive continuous density. There exists a sequence of triangular mappings wherein the th component of each map has the form , and wherein the corresponding conditional networks are universal approximators (e.g., feedforward neural networks), such that the pushforward measure weakly.
A proof for Theorem 3.1 appears in Huang2018. In Appendix A.2 we provide a different and more straightforward proof of this theorem.
While weak convergence of measures is important, it does not necessarily imply pointwise convergence of the corresponding process densities, which is central to intensity function estimation. In the next theorem, we establish conditions under which pointwise convergence of the process densities is guaranteed. For ease of exposition we only consider the one dimensional case when establishing the conditions, but the generalization to higher dimensions is straightforward.
Theorem2Let be a non-homogeneous Poisson process with positive continuous process density on with corresponding measure . Let be a probability measure on with a positive continuous density , and assume that there exists a sequence of maps such that weakly. Furthermore, assume that and are equi-continuous, is bounded for each , and that the reference density is Lipschitz continuous and bounded. Then
with respect to the sup norm on any compact subset of .
We give a proof for Theorem 3.2 in Appendix A.3. Theorem 3.2 is important as it states that, under the specified conditions, any Lipschitz continuous and bounded intensity function can be approximated arbitrarily well pointwise when using increasing triangular maps. We note that while the conditions themselves appear to be quite mild, verifying them when is an increasing triangular map is not straightforward.
3.4 Simulating from the fitted point process
An attraction of using measure transport is that one can readily simulate data based on the estimated intensity function without resorting to methods like thinning, which can be inefficient when the Poisson process is highly non-homogeneous. Here, one simulates from the simple, known reference density , and then pushes back the points through the inverse map. Since the maps we use are triangular, their inverse can be found in a relatively straightforward manner.
Consider a point in the reference domain. We give an algorithm for computing , where is a (single) increasing triangular map, in Algorithm 1. Note that inversion under the increasing triangular map involves solving univariate root-finding problems. These problems can be efficiently solved since each component of the map is continuous and increasing.
Now, when we have triangular maps in composition, say, we can compute the inverse by iteratively applying Algorithm 1 using . Algorithm 2 gives an algorithm for simulating from the fitted point process for the case when the reference probability measure is the standard multivariate normal distribution.
3.5 Standard Error Estimation
The intensity function is fitted by solving the problem given in (16
). The standard error of the fitted intensity function can be estimated using a non-parametric bootstrapping approach(Efron1981). We construct bootstrap samples by drawing the number of points
, from a Poisson distribution with rate parameter. For each we then randomly sample points from the observed points with replacement, and fit the process density to each of these bootstrap samples, to obtain estimated process densities . The standard error of the intensity function evaluated at any point
is then obtained by finding the empirical standard deviation of. Algorithm 3 gives a summary.
Standard error estimation can also be performed using a parametric bootstrap approach (Efron1979), where bootstrap samples are obtained from the fitted intensity function. However, this would require running Algorithms 1 and 2 times, which would be considerably more computationally demanding. We therefore do not consider this bootstrap strategy here.
In this section we illustrate the application of our proposed method through simulation experiments (Section 4.1) and in the context of earthquake intensity modeling (Section 4.2). The purpose of the simulation experiments is to demonstrate the validity of our approach and to explore the sensitivity of the estimates to the conditional networks’ structure. All our illustrations are in a one or two dimensional setting, as these cover the majority of applications, but our method is easily scalable to higher dimensions due to the map’s triangular structure should this be needed.
4.1 Simulation Experiments
In this section we illustrate our method on simulated data in both a one dimensional and a two dimensional setting. Our method requires one to specify the number of compositions of triangular maps, the width of the neural network in the triangular maps, the number of layers in the conditional networks (feedforward neural networks), and the width of each layer in each conditional network. In neural networks and deep learning literature, choosing the optimal network structure is an open problem. For shallow feedforward neural networks, that is, neural networks with one or two hidden layers, information criteria based methods (Fogel1991)
and heuristic algorithms(Leung2003; Tsai2006) have been proposed to determine the optimal width of the network. However, to the best of our knowledge, analogous methods are not available for deep neural networks and neural networks with complicated structure. In higher dimensional problems, there is theoretical support for adopting a very deep neural network structure due to its representational power (Eldan2016; Raghu2017). However, since point process realizations typically lie in lower dimensional space, such results are less relevant.
We found that in low-dimensional settings our estimates did not change considerably with the number of layers in the conditional network and therefore, here, we fix the number of layers to one. That is, we let each be the output of a neural network of one layer. In both simulation experiments we set the widths of the neural networks in both the triangular maps (i.e., ) and the conditional networks, to 64. We set this number by successively increasing the network widths in powers of two until the intensity-function estimate was not improved. We also used network widths of 64 in the application case study of Section 4.2.
For the one dimensional study, we first simulated events (via thinning) from the following one-dimensional intensity function,
We then fitted several models to the simulated events, each with a different number of compositions of triangular maps. The procedure of simulating and model fitting was repeated 40 times in order to assess the variability in the estimated intensity functions. Each model fitting required approximately two minutes on a graphics processing unit.
The average and empirical standard deviation of the distance between the true intensity function and the estimated intensity function are shown in Table 1
. Reassuringly, we see that the estimates, and the variability thereof, are consistent across different numbers of compositions of triangular maps, although we observe a slight improvement when having four layers instead of one layer. For reference, we also provide the results from kernel density estimation, where the bandwidth of a Gaussian kernel was chosen using Silverman’s rule of thumb(Silverman1986). In this experiment we observe some improvement in using deep compositional maps over conventional kernel density estimation, although this improvement is not substantial. For illustration, the estimated intensity functions using four compositions of triangular maps are shown in Figure 1.
For the two-dimensional simulation study, we simulated data from the following intensity function:
We used the same procedure as in the one dimensional case study whereby we fitted each model to the events simulated from the intensity function. We again simulated and fit 40 times to assess the variability of our estimates. Each model-fitting required approximately ten minutes on a graphics processing unit.
The average and empirical standard deviation of the distance between the true intensity function and the estimated intensity function are shown in Table 2. As in the one dimensional case, we do not observe substantial differences in the estimates when the number of compositions is varied. In this case, the measure transport approach performed slightly worse than kernel density estimation, although again, the difference is not substantial. The true intensity surface, together with the average and standard deviation of the estimated intensity surfaces across the 40 simulations, for the case of four compositions of triangular maps, are shown in Figure 2
. We see from the plots that the proposed method manages to recover the true intensity surface on average, and that the variability in the estimation is large when the true intensity is large. This is expected since the variance of a Poisson random variable is proportional to its mean.
In summary, these experiments illustrate that our method based on measure transport is computationally efficient, and that it does not overfit as the number of compositions of triangular maps increases. Choosing the number of compositions using a formal model selection approach is desirable; however, as quantifying the complexity of the proposed model is difficult, various information criteria such as the Bayesian Information Criterion (BIC) are not applicable. In these low-dimensional settings we do not observe any improvement over kernel density estimation for intensity function estimation, and the strengths of a measure transport approach appear to lie in the ease with which one can conditionally simulate, do uncertainty quantification, and model point processes in a high dimensional space.
|No. of compositions of triangular maps||1||2||3||4||5||KDE|
|Standard deviation of distance||10.2||8.4||7.6||11.3||13.6||15.5|
|No. of compositions of triangular maps||1||2||3||4||5||KDE|
|Standard deviation of distance||37.3||35.9||36.3||35.3||34.2||14.9|
4.2 Modeling Earthquake Data
In this section we apply our method for intensity function estimation to an earthquake data set comprising 1000 seismic events of body-wave magnitude (MB) over 4.0. The data set is available from the R datasets package. The events we analyze are those that occurred near Fiji from 1964 onwards. The left panel of Figure 3 shows a scatter plot of locations of the observed seismic events.
We fitted our model using a composition of five triangular maps. The estimated intensity surface and the standard error surface obtained using Algorithm 3 are shown in the middle and right panels of Figure 3, respectively. As was observed in the simulation experiments, we see that the estimated standard error is large in areas where the estimated intensity is high. The probability that the intensity function exceeds various threshold can also be estimated using non-parametric bootstrap resampling; some examples of these exceedance probability plots are shown in Figure 4.
A ubiquitous model used in such applications is the log-Gaussian Cox process (LGCP). For comparative purposes, here we fit an LGCP using the package inlabru (Bachl_2019), with the latent Gaussian process equipped with a constant (unknown) mean and a Matérn covariance function with smoothness parameter
. The Gaussian process was approximated via a stochastic partial differential equation(Lindgren_2011) on a mesh comprising 2482 vertices. Approximate inference and prediction required about two minutes on a fine grid comprising 15262 pixels.
For both our intensity-function estimate, and the posterior median intensity function from the LGCP, we compute a QQ-plot. In this QQ-plot we use the horizontal axis to represent the fitted quantiles from the density estimate and the vertical axis to represent the empirical quantiles obtained from the observational data; the identity line is used to denote perfect agreement. We see from Figure5 that the intensity function estimates using both models are reasonable, with that obtained using measure transport slightly better. The Kolmogorov-Smirnov statistic for our approach is 0.039 while that from the LGCP is 0.048.
This paper develops a general and scalable approach to the problem of modeling and estimating the intensity function of a non-homogeneous Poisson process. We leverage the measure transport framework through compositions of triangular maps to model the unknown intensity function, and utilize deep learning technologies for efficient inference. The developed model is shown to have the universal property whereby any positive continuous intensity function can be approximated arbitrarily well.
Our experiments clearly show that in the low-dimensional settings typically associated with spatial problems, the benefit of measure transport for intensity function estimation over simpler methods such as KDE, and the use of LGCPs, is not substantial. However, the measure transport framework brings with it other advantages. Notably, the use of a simple reference density allows one to easily simulate point processes, and back transform the coordinates to the original space, with little effort. This leads to an efficient simulation algorithm, as well as an efficient bootstrap algorithm for uncertainty quantification. Second, our approach has the potential to recover spatial properties (such as anisotropy and nonstationarity), that would require additional modeling effort with models such as the LGCP, or more sophisticated kernels with KDE. Finally, our approach is highly scalable, and can be extended to higher dimensional spaces with no modification to the underlying software.
There are several possible avenues for future work. First, in this work we have only considered low-dimensional spatial problems. However, the measure transport approach naturally extends to higher-dimensional spaces. For spatio-temporal point processes, for example, one could simply add an additional, temporal, dimension to the two-dimensional spatial model. Second, a simple way to incorporate covariate information, which is not as straightforward as in an LGCP, say, will be important for the approach to find wide applicability in a practical setting.
Code to reproduce the results in the simulation and real-data illustrations is available as supplementary material.
AZ–M was supported by the Australian Research Council (ARC) Discovery Early Career Research Award, DE180100203. The authors would like to thank Noel Cressie for helpful discussions on bootstrapping.
Appendix A Proof of Results
a.1 Proof of Proposition 3.1
By definition of the KL divergence,
where is the expectation taken with respect to the density . By Campbell’s theorem, we have that
Therefore, from (9),
Combining these two equalities completes the proof. ∎
a.2 Proof of Theorem 3.1
Here we provide an alternative proof to that in Huang2018 that relies on establishing four lemmas. The first lemma is a corollary of Proposition 2.5 of Bogachev2005.
Let and be probability measures on with continuous positive densities with respect to the Lebesgue measure. Then there exists an increasing continuous triangular map such that
Lemma A.1 implies that it is sufficient to consider the space of increasing continuous triangular maps when one is seeking to push forward the measure to another, usually simpler, reference measure . The following two lemmas show that the triangular maps constructed using the neural autoregressive flows are indeed dense in the space of continuous increasing triangular maps.
The set of functions
is dense in the space of continuous monotonically increasing functions on any compact interval .
Define where . Choose such that , and
Define . Let with , chosen such that
For , let where are chosen such that
Let for . We now show by induction that, under this construction,
for . Consider . We have, for ,
Assume that . Consider for .
First, for ,
Second, for ,