Combining smooth constraint for building DAG with normalizing flow in order to replace autoregressive transformations while keeping tractable Jacobian.
Normalizing flows model complex probability distributions by combining a base distribution with a series of bijective neural networks. State-of-the-art architectures rely on coupling and autoregressive transformations to lift up invertible functions from scalars to vectors. In this work, we revisit these transformations as probabilistic graphical models, showing that a flow reduces to a Bayesian network with a pre-defined topology and a learnable density at each node. From this new perspective, we propose the graphical normalizing flow, a new invertible transformation with either a prescribed or a learnable graphical structure. This model provides a promising way to inject domain knowledge into normalizing flows while preserving both the interpretability of Bayesian networks and the representation capacity of normalizing flows. We demonstrate experimentally that normalizing flows built on top of graphical conditioners are competitive density estimators. Finally, we illustrate how inductive bias can be embedded into normalizing flows by parameterizing graphical conditioners with convolutional networks.READ FULL TEXT VIEW PDF
Combining smooth constraint for building DAG with normalizing flow in order to replace autoregressive transformations while keeping tractable Jacobian.
Normalizing flows (NFs, Rezende and Mohamed, 2015) have proven to be an effective way to model complex data distributions with neural networks. These models map data points to latent variables through an invertible function while keeping track of the change of density caused by the transformation. In contrast to variational auto-encoders (VAEs) and generative adversarial networks (GANs), NFs provide access to the exact likelihood of the data under the model, hence offering a sound and direct way to optimize the network parameters. Normalizing flows have proven to be of practical interest as demonstrated by Oord et al. (2018) for speech synthesis, by Rezende and Mohamed (2015) for variational inference or by Papamakarios et al. (2019b)
for simulation-based inference. Yet, their usage as a base component of the machine learning toolbox is still limited in comparison to GANs or VAEs. Recent efforts have been made byPapamakarios et al. (2019a) to define the fundamental principles of flow design and by Durkan et al. (2019) to provide coding tools for modular implementations. We argue that normalizing flows would gain in popularity by offering stronger inductive bias as well as more interpretability.
Sometimes forgotten in favor of more data oriented methods, probabilistic graphical models (PGMs) have been popular for modeling complex data distributions while being relatively simple to build and read (Koller and Friedman, 2009; Johnson et al., 2016). Among PGMs, Bayesian networks (BNs, Pearl and Russell, 2011) offer a balance between modeling capacity and simplicity. Most notably, these models have been at the basis of expert systems before the big data era (e.g. Díez et al. (1997); Kahn et al. (1997); Seixas et al. (2014)) and were commonly used to merge qualitative expert knowledge and quantitative information together. On the one hand, experts stated independence assumptions that should be encoded by the structure of the network. On the other hand, data were used to estimate the parameters of the conditional probabilities/densities encoding the quantitative aspects of the data distribution. These models have progressively received less attention from the machine learning community because of the difficulty to scale to high-dimensional datasets.
Driven by the objective of bringing intuition into normalizing flows and the proven relevance of BNs for combining qualitative and quantitative reasoning, we summarize our contributions as follows:
From the insight that coupling and autoregressive transformations can be reduced to Bayesian networks with a fixed topology, we introduce the more general graphical conditioner for normalizing flows, featuring either a prescribed or a learnable BN topology.
We show that graphical normalizing flows perform well in a large variety of low and high-dimensional tasks. They are not only competitive as a black-box normalizing flow, but also provide an operational way to introduce domain knowledge into neural network based density estimation.
A normalizing flow is defined as a sequence of invertible transformation steps () composed together to create an expressive invertible mapping . This mapping can be used to perform density estimation, using to map a sample onto a latent vector equipped with a density . The transformation implicitly defines a density as given by the change of variables formula,
where is the Jacobian of with respect to . The resulting model is trained by maximizing the likelihood of the training dataset . Unless needed, we will not distinguish between and for the rest of our discussion.
In general the steps can take any form as long as they define a bijective map. Here, we focus on a sub-class of normalizing flows for which the steps can be mathematically described as follows:
where the are the conditioners which role is to constrain the structure of the Jacobian of . The functions , partially parameterized by their conditioner, must be invertible with respect to their input variable . They are often referred to as transformers, however in this work we will use the term normalizers to avoid any confusion with attention-based transformer architectures.
The conditioners examined in this work can be combined with any normalizers. In particular, we consider affine and monotonic normalizers. An affine normalizer can be expressed as , where and are computed by the conditioner. There exist multiple methods to parameterize monotonic normalizers, but in this work we rely on Unconstrained Monotonic Neural Networks (UMNNs, Wehenkel and Louppe, 2019) which can be expressed as , where is an embedding made by the conditioner and is a neural network with a strictly positive scalar output.
A Bayesian network is a directed acyclic graph (DAG) that represents independence assumptions between the components of a random vector. Formally, let be a random vector distributed under . A BN associated to is a directed acyclic graph made of vertices representing the components of . In this kind of network, the absence of edges models conditional independence between groups of components through the concept of d-separation (Geiger et al., 1990). A BN is a valid representation of a random vector iff its density can be factorized as:
where denotes the set of parents of the vertex and is the adjacency matrix of the BN. As an example, Fig. 0(a) is a valid BN for any distribution over
because it does not state any independence and leads to a factorization that corresponds to the chain rule. However, in practice we seek for a sparse and valid BN which models most of the independence between the components ofand leads to an efficient factorization of the modeled probability distribution.
To be of practical use, NFs must be composed of transformations for which the determinant of the Jacobian can be computed efficiently, otherwise its evaluation would be running in . A common solution is to use autoregressive conditioners, i.e., such that the conditioners
are functions of the first components of . This particular form constrains the Jacobian of to be lower triangular, which makes the computation of its determinant .
For the multivariate density induced by and , we can use the chain rule to express the joint probability of as a product of univariate conditional densities,
When is a factored distribution , we identify that each component coupled with the corresponding function encodes for the conditional . Therefore, and as illustrated in Fig. 0(a), autoregressive transformations can be seen as a way to model the conditional factors of a BN that does not state any independence. This becomes clear if we define and compare (4) with (3).
The complexity of the conditional factors strongly depends on the ordering of the vector components. While not hurting the universal representation capacity of normalizing flows, the agnostic ordering used in autoregressive transformations leads to poor inductive bias and to factors that are most of the time difficult to learn. In practice, one often alleviates the arbitrariness of the ordering by stacking multiple autoregressive transformations combined with random permutations on top of each other.
Coupling layers (Dinh et al., 2017) are another popular type of conditioners used in normalizing flows. The conditioners made from coupling layers are defined as
where the underlined denote constant values and is a hyper-parameter usually set to . As for autoregressive conditioners, the Jacobian of made of coupling layers is lower triangular. Again, and as shown in Fig. 0(b) and 0(c), these transformations can be seen as a specific class of BN where for and for . D-separation can be used to read off the independencies stated by this class of BNs such as the conditional independence between each pair in knowing . For this reason, and in contrast to autoregressive transformations, coupling layers are not by themselves universal density approximators even when associated with very expressive normalizers . In practice, these structural independencies can be relaxed by stacking multiple layers, although they can also lead to useful inductive bias, such as in the multi-scale architecture with checkerboard masking (Kingma and Dhariwal, 2018).
Following up on the previous discussion, we introduce the graphical conditioner architecture. We motivate our approach by observing that the topological ordering (or ancestral ordering) of any BN leads to a lower triangular adjacency matrix whose determinant computation is . Therefore, conditioning factors selected by following a BN adjacency matrix necessarily leads to a transformation whose Jacobian determinant remains efficient to compute. We provide a formal proof of this result in Appendix B.
Formally, given a BN with adjacency matrix , we define the graphical conditioner as being
where is the element-wise product between the vectors and the one made of the row of – i.e., is used to mask on . This new conditioner architecture is illustrated in Fig. 2 and leads to NFs that can be inverted by sequentially inverting each component in the topological ordering.
The graphical conditioner architecture can be used to learn the conditional factors in a continuous BN while elegantly setting structural independencies prescribed from domain knowledge. We also now notice how coupling and autoregressive conditioners are special cases in which the adjacency matrix reflects the classes of BNs discussed in Section 3.
In many cases, defining the whole structure of a BN is not possible due to a lack of knowledge about the problem at hand. Fortunately, not only the density at each node is learnable, but also the DAG structure itself: defining an arbitrary topology and ordering, as it is implicitly the case for autoregressive and coupling conditioners, is unnecessary.
Building upon the work of Zheng et al. (2018)
, we convert the combinatorial optimization of score-based learning of a DAG into a continuous optimization by relaxing the domain ofto real numbers instead of boolean values. That is,
where is the graph with adjacency matrix and is the likelihood function evaluated through the NF . The constraint is expressed as (Yu et al., 2019)
In the case of positively valued , an element of is equal to zero if and only if there exists no path going from node to node that is made of edges. Intuitively expresses to which extent the graph is cyclic. Indeed, the element of will be as larger as there are low cost paths (i.e., paths made of edges that corresponds to large values in ) of length from to .
In comparison to our work, Zheng et al. (2018) use the quadratic loss on the corresponding linear structural equation model (SEM) as the score function . By attaching normalizing flows to topology learning, our method does not make strong assumptions on the form of the conditional factors because it has a continuously adjustable level of complexity as set by the capacity of the normalizers and the functions .
In order to learn the BN topology from the data, the adjacency matrix must be relaxed to contain real values instead of binary ones and to be temporarily (during training) not defining a DAG. Directly plugging the real matrix in (5) would be impractical because the quantity of information going from node to node would not continuously relate to the value of . Indeed, the information would either be null if or pass completely if not. Instead, a natural solution consists in adding noise that is proportional to the absolute value of . To this end, we create the stochastic pseudo binary valued matrix from , defined as:
where normalizes the values of between 0 and 1, being close to for large values and close to zero for values close to . The hyper-parameter controls the sampling temperature and is fixed to in all our experiments. If is large, this transformation maps a real value to a large value with high probability and to a small value with low probability; if is small, it is the other way around. In contrast to directly using the matrix , this stochastic transformation allows to create a direct and continuous relationship between the weight of the edges and the quantity of information that can transit from node to node .
We rely on the augmented Lagrangian approach to solve the constrained optimization problem (6) as initially proposed by Zheng et al. (2018). This optimization procedure requires solving iteratively the following sub-problems:
where and respectively denote the Lagrangian and penalty coefficients of the sub-problem . Plugging in for the likelihood given by , we arrive to the following sequence of problems:
where is the number of training samples . We solve these optimization problems by mini-batch stochastic gradient ascent. It is worth noting that during training the graph represented by the matrix can be cyclic making the computation of the log-likelihood as above only approximately correct. However, the method is shown to converge to a solution that satisfies the constraint, making the final optimization steps correct in terms of the likelihood. The exact optimization procedure used for our experiments is provided in Appendix A.
|(a)||Coupling||round(5.595652333333334, 2)round(0.002610780002646944, 2) 1||round(3.0495433333333337, 2)round(0.007895648435829896, 2) 1||round(25.741226333333334, 2)round(0.006650417346971497, 2) 1||round(38.34233366666667, 2)round(0.024608616164984692, 2) 1||round(-57.334731, 2)round(0.0024840547497991497, 2) 1|
|Autoreg.||round(3.552735333333333, 2)round(0.001566747019215928, 2) 1||round(0.34099799999999997, 2)round(0.008537987702029085, 2) 1||round(21.661752333333336, 2)round(0.007925831074544806, 2) 1||round(16.6966, 2)round(0.05012842471758638, 2) 1||round(-63.73638033333333, 2)round(0.004330463049399384, 2) 1|
|Graphical||round(2.7979636666666665, 2)round(0.00799790907397394, 2) 1||round(-1.985552666666667, 2)round(0.020352012665963974, 2) 1||round(21.18043833333333, 2)round(0.07373030467106993, 2) 1||round(19.672145, 2)round(0.05610214543134648, 2) 1||round(-62.85277466666667, 2)round(0.06672677328396033, 2) 1|
|(b)||Coupling||round(-0.246356, 2)round(0.0012096316243661454, 2) 1||round(-5.121069666666667, 2)round(0.029769947064484863, 2) 1||round(20.545104333333335, 2)round(0.04462281262115015, 2) 1||round(32.04131233333334, 2)round(0.12106401195327407, 2) 1||round(-107.17215866666668, 2)round(0.4562053354454408, 2) 1|
|Autoreg.||round(-0.5769643333333333, 2)round(0.0017583917525840062, 2) 1||round(-9.786357666666667, 2)round(0.042050408403353966, 2) 1||round(14.515809666666668, 2)round(0.15825141180061844, 2) 1||round(11.65676, 2)round(0.017674348323677303, 2) 1||round(-151.28819533333333, 2)round(0.3101846263988979, 2) 1|
|Graphical||round(-0.620236, 2)round(0.043806243550434706, 2) 1||round(-10.146548666666666, 2)round(0.1515902526183299, 2) 1||round(14.165973333333334, 2)round(0.13134025332784377, 2) 1||round(16.227559, 2)round(0.5166131106860017, 2) 1||round(-155.22247133333335, 2)round(0.10754154750090603, 2) 1|
|(c)||Ar-Affine||round(-0.14, 2)round(0.005, 2) 1||round(-9.07, 2)round(0.01, 2) 1||round(17.7, 2)round(0.01, 2) 1||round(11.75, 2)round(0.22, 2) 1||round(-155.69, 2)round(0.14, 2) 1|
|Ar-Monotonic||round(-0.63, 2)round(0.01, 2) 1||round(-10.89, 2)round(0.7, 2) 1||round(13.99, 2)round(.21, 2) 1||round(9.67, 2)round(.13, 2) 1||round(-157.98, 2)round(.01, 2) 1|
Average negative log-likelihood on test data over 3 runs, error bars are equal to the standard deviation. Results are reported in nats; lower is better. The best performing architecture per category for each dataset is written in bold. (a) 1-step affine normalizers (b) 1-step monotonic normalizers (c) 5-steps autoregressive conditioners.
In these experiments, we compare autoregressive, coupling and graphical conditioners on benchmark tabular datasets for density estimation. We evaluate each conditioner in combination with monotonic and affine normalizers. We only compare NFs with a single transformation step because our focus is on the conditioner capacity. To provide a fair comparison we have fixed in advance the neural architectures used to parameterize the normalizers and conditioners as well as the training parameters by taking inspiration from those used by Wehenkel and Louppe (2019) and Papamakarios et al. (2017). The variable ordering of each dataset is chosen as initially proposed by Papamakarios et al. (2017). All hyper-parameters are provided in Appendix C and a public implementation will be released on Github.
First, Table 2 presents the test negative log-likelihood obtained by each architecture. These results indicate that graphical conditioners offer the best performance in general. Unsurprisingly, coupling layers show the worse performance, due to the arbitrarily assumed independencies. Autoregressive and graphical conditioners show very similar performance for monotonic normalizers, the latter being slightly better on 4 out of the 5 datasets. To contextualize the discussion the table also provides results obtained with 5-step NFs composed of autoregressive conditioners combined with affine (Papamakarios et al., 2017) and monotonic normalizers. Comparing the results together, we see that while additional steps lead to noticeable improvements for affine normalizers, benefits are questionable for monotonic transformations.
Second, Table 2 presents the number of edges in the BN associated with each flow. For power and gas, the number of edges found by the graphical conditioners is close or equal to the maximum number of edges. Interestingly, graphical conditioners outperform autoregressive conditioners on these two tasks, demonstrating the value of finding an appropriate ordering particularly when using affine normalizers. Moreover, graphical conditioners correspond to BNs whose sparsity is largely greater than for autoregressive conditioners while providing equivalent if not better performance.
We now demonstrate how graphical conditioners can be used to fold in domain knowledge into NFs by performing density estimation on MNIST images. The design of the graphical conditioner is adapted to images by parameterizing the functions
with convolutional neural networks (CNNs) whose parameters are shared for allas illustrated in Fig. 2. Inputs to the network are masked images specified by both the adjacency matrix and the entire input image . Using a CNN together with the graphical conditioner allows for an inductive bias suitably designed for processing images. We consider single step normalizing flows whose conditioners are either coupling, autoregressive or graphical-CNN as described above, each combined with either affine or monotonic normalizers. The graphical conditioners that we use include an additional inductive bias that enforces a sparsity constraint on and which prevents a pixel’s parents to be too distant from their descendants in the images. Formally, given a pixel located at , only the pixels are allowed to be its parents.
Results reported in Table 3 show that graphical conditioners lead to the best performing affine NFs even if they are made of a single step. This performance gain can probably be attributed to the combination of both learning a masking scheme and processing the result with a convolutional network. These results also show that when the capacity of the normalizers is limited, finding a meaningful factorization is very effective to improve performance. The number of edges in the equivalent BN is about two orders of magnitude smaller than for coupling and autoregressive conditioners. This sparsity is beneficial for the inversion since the evaluation of the inverse of the flow requires a number of steps equal to the depth (Bezek, 2016) of the equivalent BN. Indeed, we find that while obtaining density models that are as expressive, the computation complexity to generate samples is approximately divided by in comparison to the autoregressive flows made of 5 steps and comprising many more parameters.
Fig. 4 shows random samples generated with the architectures we trained on MNIST. These few samples are representative of the beneficial inductive bias produced by the combination of the graphical conditioner with a CNN. While the global structure of the images produced by the different architectures is of similar quality, the images produced by graphical conditioners are more consistent locally. We believe that the possibility granted by graphical conditioners to introduce domain knowledge could be beneficial for a large panel of density estimation applications. Fig. 3 shows the in and out degrees of the equivalent BN learned on MNIST by a NF made of graphical conditioners. It can be observed that most of the connections in the equivalent BN come from the center of the images. In a certain sense, this shows that the graphical conditioner has successfully discovered the most informative pixels and made sense of their relationships.
These experiments show that, in addition to be a favorable tool for introducing inductive bias into NFs, graphical conditioners open the possibility to build BNs for large datasets. Unlocking the BN machinery for modern datasets and computing infrastructures.
|(a)||G-Affine (1)||round(1.8074786666666667, 2)round(0.006730915803629995, 2) 1|
|G-Monotonic (1)||round(1.1694063333333333, 2)round(0.027104726898130156, 2) 1|
|(b)||A-Affine (1)||round(2.1196110000000004, 2)round(0.02185270459233812, 2) 1|
|A-Monotonic (1)||round(1.3669339999999999, 2)round(0.040867502713035936, 2) 1|
|C-Affine (1)||round(2.389779666666667, 2)round(0.02812879663658265, 2) 1|
|C-Monotonic (1)||round(1.6721043333333334, 2)round(0.08022989393126614, 2) 1|
|(c)||A-Affine (5)||round(1.89, 2)round(0.01, 2) 1|
|A-Monotonic (5)||round(1.13, 2)round(0.02, 2) 1|
Formal BN topology learning has extensively been studied for now more than 30 years and many strong theoretical results on the computational complexity have been obtained. Most of these results however focus on discrete random variables, and how they generalize in the continuous case is yet to be explained. The topic of BN topology learning for discrete variables has been proven to be NP-hard byChickering et al. (2004). However, while some greedy algorithms exist, they do not lead in general to a minimal I-map although allowing for an efficient factorization of random discrete vectors distributions in most of the cases. These algorithms are usually separated between the constrained-based family such as the PC algorithm (Spirtes et al., 2001) or the incremental association Markov blanket (Koller and Friedman, 2009) and the score-based family as used in the present work. Finding the best BN topology for continuous variables has not been proven to be NP-hard however the results for discrete variables suggest that without strong assumptions on the function class the problem is hard.
The recent progress made in the continuous setting relies on the heuristic used in score-based methods. In particular,Zheng et al. (2018) showed that the acyclicity constraint required in BNs can be expressed as a continuous function of the adjacency matrix, allowing Lagrangian formulation to be used. Yu et al. (2019) proposed DAG-GNN, a follow up work of Zheng et al. (2018) which relies on variational inference and auto-encoders to generalize the method to non-linear structural equation models. Further investigation of continuous DAG learning in the context of causal models was achieved by Lachapelle et al. (2019)
. They use the adjacency matrix of the causal network as a mask over neural networks to design a score which is the log-likelihood of a parameterized normal distribution. The requirement to pre-define a parametric distribution before learning restricts the factors to simple conditional distribution. In contrast, our method combines the constraints given by the BN topology with NFs which are free-form universal density estimators. Remarkably, their method leads to an efficient one-pass computation of the joint density, however this neural masking technique could also be implemented for normalizing flow architectures such as already demonstrated byPapamakarios et al. (2017) and De Cao et al. (2019) for autoregressive conditioners.
As already mentioned, consecutive transformation steps are often combined with randomly fixed permutations in order to mitigate the ordering problem. Linear flow steps (Oliva et al., 2018) generalize these fixed permutations. They are parameterized by a matrix where is the fixed permutation matrix, and and are respectively a lower and an upper triangular matrix. Although linear flow improves the simple permutation scheme, they do still rely on an arbitrary permutation. To the best of our knowledge, graphical conditioners are the first attempt to get completely rid of any fixed permutation in NFs.
Graphical conditioners can be seen as a way to learn the right mask for each conditional factor of a factored joint distribution. In this way, the conditioners process their input as they would process the full vector, which allows for an intuitive design of the neural network that is in charge of compressing the useful information contained in the conditioning factors into embeddings. We have shown experimentally that this effectively leads to performing and efficient parameterization for applying normalizing flows on images. In addition, we have shown that normalizing flows built from graphical conditioners combined with monotonic transformations leads to very expressive density estimators. In effect, this means that enforcing some a priori known independencies can be performed thanks to graphical normalizing flows without hurting their modeling capacity. We believe such models could be of high practical interest because they cope well with large dataset and complex distributions while preserving some readability through their equivalent BN.
The research in NFs could directly benefit causality and SEM thanks to graphical conditioners. Indeed, the graphical conditioners introduced in this work clearly state a BN structure while maintaining the great modeling capacity of normalizing flows. The results obtained by graphical conditioners combined with monotonic transformations on density estimation tasks empirically demonstrate that NFs with a direct BN interpretation are very competitive density estimators. This frees the possibility to combine causal networks and normalizing flows with or without a predefined (sub-)structure. Further research should be performed to work with interventional data by finding how do-calculus can be implemented in the training strategy of graphical conditioners, enhancing them with a causal interpretation.
We have revisited coupling and autoregressive conditioners for normalizing flows as Bayesian networks. From this new perspective, we proposed the more general graphical conditioner architecture for normalizing flows. We have shown that this new architecture compares favorably with autoregressive and coupling conditioners on low and high-dimensional density estimation tasks. In addition, we have illustrated the opportunity offered by this new architecture to introduce domain knowledge into normalizing flows by combining convolutional neural networks with it. Finally, we believe that graphical conditioners could be of very strong interest to develop further research at the intersection of normalizing flows and causal networks.
In the context of density estimation and generative models, graphical normalizing flows have the potential to improve the way scientists may enforce domain assumptions through explicit and prescribed independencies. We believe they will also be helpful in uncovering and reasoning about independencies, or the lack thereof, found in data. More broadly, generative models carry a risk of being abused for the generation of fake data. This danger also applies to normalizing flows, including graphical normalizing flows.
The authors would like to thank Johann Brehmer and Louis Wehenkel for proofreading the manuscript. Antoine Wehenkel is a research fellow of the F.R.S.-FNRS (Belgium) and acknowledges its financial support. Gilles Louppe is recipient of the ULiège - NRB Chair on Big data and is thankful for the support of NRB.
Uncertainty in Artificial Intelligence, Vol. 10, pp. 139–148. External Links: Cited by: §2.
The method is computed as described by Eq. (7). The method performs a backward pass and an optimization step with the chosen optimizer (Adam in our experiments). The post-processing is peformed by and consists in thresholding the values in such that the values below a certain threshold are set to and the other values to , after post-processing the stochastic door is deactivated. The threshold is the smallest real value that makes the equivalent graph acyclic. In addition, when the value of is large such as in MNIST experiments, the value of the constraint is approximated by where is increased to along the optimization. The combination of and helps avoiding the explosion of . The update of the Lagrangian and penalty coefficients is performed at intervals separated by a fixed number of epochs and as described by Yu et al.  in equations (14) to (16).
The absolute value of the determinant of the Jacobian of a normalizing flow step based on graphical conditioners is equal to the product of its diagonal terms.
Proposition B.1 A Bayesian Network is a directed acyclic graph. Sedgewick and Wayne  showed that every directed acyclic graph has a topological ordering, it is to say an ordering of the vertices such that the starting endpoint of every edge occurs earlier in the ordering than the ending endpoint of the edge. Let us suppose that an oracle gives us the permutation matrix that orders the components of in the topological defined by . Let us introduce the following new transformation on the permuted vector . Thus the Jacobian of the transformation (with respect to ) is lower triangular with diagonal terms given by the derivative of the normalizers with respect to their input component. The determinant of such Jacobian is equal to the product of the diagonal terms. Finally, we have
because of (1) the chain rule; (2) The determinant of the product is equal to the product of the determinants; (3) The determinant of a permutation matrix is equal to or . The absolute value of the determinant of the Jacobian of is equal to the absolute value of the determinant of , the latter given by the product of its diagonal terms that are the same as the diagonal terms of . Thus the absolute value of the determinant of the Jacobian of a normalizing flow step based on graphical conditioners is equal to the product of its diagonal terms. ∎
Table 4 provides the hyper-parameters used to train the normalizing flows for the tabular density estimation tasks. In our experiments we parameterize the functions
with a unique neural network that takes a one hot encoded version ofin addition to its expected input . The embedding net architecture corresponds to the network that computes an embedding of the conditioning variables for the coupling and DAG conditioners, for the autoregressive conditioner it corresponds to the architecture of the masked autoregressive network. The output of this network is equal to ( for the autoregressive conditioner) when combined with an affine normalizer and to an hyper-parameter named embedding size when combined with a UMNN. The number of dual steps corresponds to the number of epoch between two updates of the DAGness constraint (performed as in Yu et al. ).
|N° steps Dual Update|
In addition, in all our experiments (tabular and MNIST) the integrand networks used to model the monotonic transformations have their parameters shared and receive an additional input that one hot encodes the index of the transformed variable. The models are trained until no improvement of the average log-likelihood on the validation set is observed for 10 consecutive epochs.
For all experiments the batch size was , the learning rate , the weight decay . For the graphical conditioners the number of epochs between two coefficient updates was chosen to , the greater this number the better were the performance however the longer is the optimization. The CNN is made of 2 layers of 16 convolutions with kernels followed by an MLP with two hidden layers of size and . The neural network used for the Coupling and the autoregressive conditioner are neural networks with hidden layers. For all experiments with a monotonic normalizer the size of the embedding was chosen to and the integral net was made of 3 hidden layers of size . The models are trained until no improvements of the average log-likelihood on the validation set is observed for 10 consecutive epochs.