## 1 Introduction

Probability density distribution (pdf) and conditional pdf estimations are topics of general interests, and also lie at the heart of many statistic learning and inference tasks. Traditional density estimation methods include histogram, kernel density estimation, orthogonal series density estimators, and finite mixture models (FMM) [1, 2, 3, 4]. These methods work well for data with low dimensions, but generally have difficulties when applied to data of high dimensions, i.e., the curse of dimensionality. Neural networks have achieve successes in many fields, and a few pioneering work examples of applying neural network to density estimations are independent component analysis (ICA) [5] and Gaussianization [6]. Recently, there is a surge of neural network based density estimators. Many general neural network density estimators are based on autoregressive models, typically combining normalizing flow, to name a few, masked autoencoder for distribution estimation (MADE) [7], masked autoregressive flow (MAF) for density estimation [8], neural autoregressive flows (NAF) [9], block neural autoregressive flow (B-NAF) [10], unconstrained monotonic neural network (UMNN) [11]. The Jacobian of an autoregressive neural network is a triangular matrix. This property makes it suitable for density estimation as the determinant of transformation is just the product of the diagonals of Jacobian. Neural network density estimators can be designed for data with specific formats as well. For example, non-linear independent components estimation (NICE) [12] and its successor, real NVP (Real NVP) [13], are mainly targeted for images, where the correlations among neighboring pixels are hard-coded into the models. The neural ordinary differential equation (NODE) [14] and its computationally cheaper version, free-form Jacobian of reversible dynamics (FFJORD) [16], provide yet another optimal transport based method for density estimation. These models and UMNN are different from traditional neural networks in the sense that they rely on numerical methods in both the train and inference phases. Lastly, it is possible to combine several heterogeneous models to have hybrid, potentially more powerful, density estimators, e.g., the transformation autoregressive networks (TAN) [15].

Here, we call a feedforward neural network triangular network when its connection weights form a triangular block matrix. We start from a simple two layer triangular network unit, and show that it is a universal approximator for invertible transformations with triangular Jacobian. Then, deep neural networks constructed by properly cascading such triangular network units and permutations are proposed as universal density estimators. Our method is closely related to NAF [9], and extends the work of B-NAF [10]. However, our work is novel in several ways. As B-NAF [10] extends NAF [9] resulting in more parameter economy models, our model extends B-NAF [10] and is more compact than B-NAF. Unlike B-NAF [10], it does not suffer from memory explosion when applying to data with large dimensions. Actually, we are the first to apply such methods to image density estimation benchmarks with dimensions up to thousands, and obtain state-of-the-art performance in the category of general density estimators. Furthermore, our design is highly modular and regular. It is trivial to stack our triangular network unit with other invertible components to result in potentially more powerful non-autoregressive model. This could be helpful as the autoregressive model is not necessarily the most parameter economy one for every density estimation task.

## 2 Background

Here, we repeat a few well known facts to make our paper complete. This also serves to introduce our notations.

### 2.1 Probability density functions and change of variables

We consider two continuous random variable

and connected by a smooth invertible mappingwith parameter vector

as(1) |

The probability density functions of

and , and respectively, are related by(2) |

where is called the Jacobian matrix, and takes the absolute value of the determinant of a square matrix. Note that (2) is true only for invertible mappings.

Relationship (2), also referred to as normalizing flow in some work, is useful for density estimation and data generation. Assume that is unknown, but samples drawn from it are given. For and with proper forms, we can approximate by maximizing the empirical likelihood given by the right side of (2) with respect to on samples drawn from . On the other hand, the inverse mapping of , i.e., , provides a mean for sampling from indirectly. It is common to model as neural networks due to their expressiveness.

### 2.2 Autoregressive model for density estimation

Assume that , where is a positive integer. It is always feasible to factorize recursively as

(3) |

Correspondingly, it is possible to map each vector to successively for all such that is factorized in the same fashion. Needless to say, all these mappings should be invertible. Together, they define an autoregressive model. Noting that for all , the Jacobian is a lower triangular matrix. This facilitates the calculation of its determinant, i.e.,

(4) |

Otherwise, calculating the determinant of an arbitrary dense matrix has complexity . Hence, autoregressive model is one of the preferred choices for general density estimations with large .

### 2.3 Neural networks with triangular block matrices as weights

We consider an layer feedforward neural network defines by

(5) |

where is a positive integer, and are the network inputs and outputs respectively, and are the weights and bias respectively, and is an element-wise mapping, for example, for and the identity mapping for . All these weight matrices are either lower or upper triangular block matrices, and without loss of generality, lower ones here. Each weight matrix is partitioned into blocks properly such that is a valid matrix multiplication operation for all and , where is the th block of . As these are lower triangular matrices, we have for all and . Then, Jacobian is a lower triangular block matrix for all as well, and its th diagonal block is given by

(6) |

where is a diagonal matrix taking the element-wise derivatives of as its diagonals, and is its th diagonal block defined in a proper way such that the matrices multiplications in (6) are valid. Specifically, when and have the same dimension, Jacobian reduces to a lower triangular matrix, and its determinant is simply given by

(7) |

## 3 Monotonic triangular network

### 3.1 Monotonic triangular network unit

We consider a special two layer feedforward neural network defined as

(8) |

where both random variables and have the same dimension , and are two lower triangular block matrices, and set collects all the trainable parameters. Clearly, Jacobian is a lower triangular matrix as well. To simplify the notations, we assume that all the blocks in have the same size, and the same with . Then, the block sizes of and can only be and , respectively, where is a positive integer. Let the th diagonal blocks of and be and , respectively. By (6), we have

(9) |

where is the derivative of the th nonlinearity in . To simplify the notations, we do not explicitly show ’s dependence on , and . An interesting observation is that when is monotonically increasing, and and have the same or opposite sign for all , will always be either positive or negative. Then, must be monotonic with respect to . This observation is true for all . Hence, (8) defines a monotonic network when is element-wisely monotonically increasing, and each group of and for have the same or opposite sign. The following proposition states that such networks can approximate any invertible mappings with triangular Jacobians. The proof should be straightforward. We outline it here to have a better understanding of the assumptions required by this proposition.

*Proposition 1:* The neural network defined in (8) can approximate any invertible mapping with lower triangular Jacobians arbitrarily well with assumptions: A1) is sufficiently large; A2) each group of and , , have the same or opposite sign for all ; A3) all the nonlinearities in are monotonically increasing, and lower and upper bounded.

*Proof:* The universal approximation theorem ensures that without imposing the sign constraints on the block diagonals of and , the network defined in (8) can approximate any mapping with triangular Jacobian. Hence, we only need to show that can approximate any mapping monotonic with respect to , or equivalently, can approximate any scalar function of whose outputs always have the same sign. Let us rewrite and explicitly as below

(10) | |||||

(11) |

where , , are scalar functions of introduced after reparameterization. Without loss of generality, we assume that for all , and are positive, and is bounded as and . Then, for sufficiently large , we have

(12) |

where is the Dirac delta function. Note that any scalar mapping can be rewritten as

(13) |

Thus, with sufficiently large and , the sum on the right side of (11) can approximate any such scalar mapping based on (12), (13) and the fact that can approximate any value due to the universal approximation theorem. Note that for all . These constraints are consistent with (13) since for all

Let us check each assumption used in the above proof closely. Assumption A1) is quite standard. Assumption A2) arises naturally due to (13). Assumption A3) is used in (12). Intuitively, we see that the right side of (9) is a weighted sum of the shifted and rescaled versions of the derivative of nonlinearity

. Under mild conditions, these shifted and rescaled derivatives form an over complete base for function approximation, as illustrated in Fig. 1 for the case with dimension 1. However, unbounded nonlinearities are widely used in today’s neural networks, e.g., the rectified linear unit (ReLU). Unfortunately, it is not difficult to show that such nonlinearities do not work here. The derivative of ReLU is monotonic. Hence, derivative of its weighted sum is monotonic as well as long as the weights have the same sign. But, not all monotonic functions have monotonic derivatives. Thus, any nonlinearities with monotonic derivatives cannot be used here.

### 3.2 Unconstrained monotonic network for density estimation

For density estimation, it is sufficient to constrain and to be positive for all and . It could be inconvenient to keep all these positiveness constraints on the block diagonals of and . But, it is trivial to reparameterize them to drop off these constraints. In our implementations, we choose reparameterization

(14) |

where

is the sigmoid function, and

andare free to take any real values. Hyperparameter

can be viewed as the order of our monotonic triangular network unit. Note that memory footprint of a unit is proportional to . Regarding , commonly used squash functions, e.g., the sigmoid and tanh, will work since they are monotonic and bounded. Unbounded nonlinearities, e,g., ReLU, should be avoided per our above discussions.It is a common practice to choose as a standard normal density. Then, given samples from an unknown distribution , we can estimate it in the form of (2) by minimizing expectation of the following negative-logarithm-likelihood (NLL)

(15) |

with respect to parameters . On the other hand, a monotonic network unit can be reversed in a way similar to backward substitution for solving linear triangular equations, i.e., solving for when are ready. As is monotonic with respect to , simple root finding methods like bisection could be sufficient for solving for given and .

Lastly, deep neural networks obtained by stacking several monotonic triangular network units are still monotonic, have triangular Jacobians, and can be inverted with a backward substitution like routine as well. Such networks might look superficially similar to the B-NAF model in [10]. But, there are a few key differences. First, we do not apply any nonlinear mapping to the outputs of each monotonic triangular network unit. Intuitively, squashing will deviate outputs of a unit away from normal distribution, which conflicts with our aim. Second, memory footprint of each unit is proportional to its order

. This avoids the memory explosion issue as quadratic memory consumption growth never happens in our design. Third, each unit itself defines a independent monotonic mapping. This give us the flexibility to combine it with other invertible mappings to have potentially more powerful models, as shown in the next section.## 4 Practical triangular network for density estimation and data generation

Here, we prescribe a few modifications to the monotonic triangular networks in Section 3 to enhance its usefulness in practice.

### 4.1 Non-autoregressive model by inserting permutations

Although a large enough monotonic triangular network suffices to universal density estimation as shown in the autoregressive model (3), it is unlikely the case that autoregressive model is parameter economy for all density estimation tasks. In general, we can obtain deep reversible non-autoregressive models by cascading monotonic triangular network units and permutations as below

where each MonoTriNetU and P denote an independent monotonic triangular network unit and permutation unit, respectively. Due to these inserted permutations, the Jacobian is no longer ensured to be triangular. However, it is still convenient to calculate its determinant as

(16) |

where we have used the fact that the determinant of a permutation matrix is . Different from a monotonic triangular network, a one-pass backward substitution like inversion no longer works here. Instead, we need inverse each permutation and monotonic triangular network unit step by step, down from the top layer to the bottom layer.

Let us check the network

(17) |

to see why permutation could help, where Flip is the permutation that reverses the order of a vector. Mathematically, we have , where is the unit anti-diagonal matrix, and and are two independent monotonic triangular network units. Its Jacobian is

(18) |

where is a lower triangular matrix, and is an upper triangular one. We know that any constant square matrix can be factorized as the product of an upper triangular matrix and a lower one. Similarly, we expect that the upper-lower decomposition form in (18) can approximate the Jacobians from a larger family of invertible mappings. Although we are not clear on the theoretical capacities of the network in (17), our experiences suggest that a simple flipping permutation could help for certain tasks.

### 4.2 Bijective network with a ‘log’ nonlinearity for data generation

It is worthy to point out that with a finite and bounded nonlinearities, (8) only defines an injective mapping on . Specifically, with nonlinearity ‘tanh’, we have . Thus, when the target distribution is a standard normal one, it is not possible to invert all samples drawn from . Although the probability measure of such non-invertible samples could be negligibly small for good enough density models, this brings inconveniences for applications like data generation. Here, we propose to replace a bounded nonlinearity with a ‘log’ nonlinearity, , when invertibility is a must. This ‘log’ nonlinearity is neither lower nor upper bounded. It is easy to show that the networks in Section 4.1 become bijective on after adopting this ‘log’ nonlinearity. Our experiences suggests that the ‘log’ nonlinearity behaves similar to a squash one since its output amplitudes grow slowly enough with respect to its inputs. It even could outperform the tanh nonlinearity in negative-logarithm-likelihood performance index when is too small and is the standard normal density. This is not astonishing since the sum of a few hard bounded terms cannot fit well with the tails of a normal distribution.

### 4.3 Compact model storage

Two strategies are used to save our neural networks compactly. First, instead of saving the two triangular matrices in (8) separately, we save most of their elements compactly in a dense matrix. Let , where is the block diagonals of . Then, we can put in the upper triangular part of to save memory. The block diagonals of is saved as a separate vector. Second, we generate the masks for exacting and from the dense matrix they are nested in on the fly. Note that , and thus the masks, can be inferred from the sizes of and . Thus, there is no need to save them in advance. These two strategies reduce the model memory footprint to about one fourth of value required by naively saving , and their masks in separated matrices. Clearly, it is the design regularity of our monotonic triangular network unit to make such significant storage saving possible.

### 4.4 Linear autoregressive input normalization

Lastly, it is known that input normalization generally helps to facilitate training. We propose to use a linear autoregressive model to preprocess the input as , where and are the (empirical) mean and covariance matrix of , respectively, and is a lower triangular matrix given by Cholesky decomposition . Unlike the more popular principle component analysis (PCA) based whitening normalization preprocessing, this one works seamlessly with our neural network since itself is a simple autoregressive unit. The preprocessing parameters, and , have no need to be exact, nor need to be adapted during the training, since they could be absorbed into the input layer of the first monotonic triangular network unit as and . Indeed, They will not show up in the trained and finalized model coefficients after being absorbed.

## 5 Related work

Our monotonic triangular networks are most related to NAF [9] and B-NAF [10]. All are neural autoregressive models with triangular Jacobians, and are designed to be universal density estimators. But, ours are built on a specific monotonic triangular network unit, and are more modular, compact and parameter economy. Unlike B-NAF [10], we exclude any intermediate layer with memory footprint grows quadratically with respect to its size. The memory consumption of a unit is , possibly the best we could achieve without assuming any correlation structures on the data. Our monotonic triangular network unit also provides a basic building block for constructing deep, not necessarily autoregressive, revertible models. It could be used together with other heterogeneous transformations to have hybrid models, which hopefully could be more powerful than single architecture ones, as demonstrated in TAN [15].

More generally, our work is related to those autoregressive models for density estimation. But, different from many such models, ours are shown to be universal density estimators. Let us take MAF [8], which can be viewed as a generalization of Real NVP [13], as the example to have a comparison study. In our notations, MAF defines the monotonic mapping from to as

(19) |

where and are two scalar neural networks. From (19), one can see that this model is especially attractive for data generation, while our model does not provide a closed-form solution solving for from and . However, its does not depend on . Thus, it might have difficulty in approximating monotonic mapping whose depends on .

We notice that quite a few universal density estimators, e.g., UMNN [11], NODE [14], and FFJORD [16], rely on numerical methods to calculate the transformations and their Jacobians. These models could perform much better than many classic closed-form neural network models, but are typically computationally expensive, and their results might be inexact, depending on the underlying ODE solver or integrator, and the numerical conditioning, e.g., Lipshitz constant, of the problem at hand. Our invertible neural networks have the same theoretical capacities for density estimation as these models, but tends to be neater and more lightly weighted.

Lastly, we would like to point out that our monotonic triangular network unit resembles a mini-autoencoder. From (8), we see that the input (encoding) layer maps onto a feature space with higher dimension such that the nonlinear dependence among inputs could be more easily captured. The output (decoding) layer tries to assemble those features into independent components. When several such units are stacked, each unit’s output layer serves as a bottleneck layer to prevent memory explosion.

## 6 Experimental Results

We have implemented the above described methods in Pytorch. Please visit

https://github.com/lixilinx/TriNet4PdfEst to obtain the code for reproducing the results reported below.### 6.1 Toy demo comparing nonlinearities

The considered two dimensional density is defined by

(20) |

We have trained three invertible triangular networks using different nonlinearities. Each consists of two cascaded monotonic units with the same block size and followed with flip permutations. Fig. 2 shows the estimated densities from models with nonlinearities , and

, respectively. The ‘log-sigmoid’ nonlinearity can be viewed as a soft version of ReLU. As discussed in Section 3.1, it should not be used for density estimation. Both the ‘log’ and tanh nonlinearities successfully produce the ring structure. Still, tanh performs better here.

### 6.2 Density estimation

We consider two benchmarks, MNIST and CIFAR-10 image density estimations, here as they are challenging and their baseline results are readily available. The label information is discarded. The pixel values in space are first dequantized by adding uniform noise , then rescaled to range

, and finally transfer to the logit space with mapping

, where andfor MNIST and CIFAR-10, respectively. We estimate the density in the logit space, but report the bits-per-dimension performance index for comparison as it is more popular than the negative-logarithm-likelihood one. One should not confuse the ‘bit’ here with the ‘bit’ of discrete random variables. The bits-per-dimension index is the normalized negative-

-likelihood of the original images in space .We have trained invertible triangular networks consists of four cascaded monotonic units with different settings. Block sizes are and for MNIST and CIFAR-10, respectively. Batch size is in both tasks. All models are trained with Adam starting from initial step size . We reduce the step size by one order of magnitude when no performance improvement is observed on the validation set. The train, validation and test sets are defined following the ways in [8]. We find that our model could seriously overfit the CIFAR-10 dataset even with early stopping and such a small (about bits-per-dimension gap between train and validation set). We do not try model regularization techniques like weight decaying. Instead, we simply augment the train set by randomly and circularly shifting the images up to pixels both horizontally and vertically.

Detailed negative-logarithm-likelihood and bits-per-dimension indices on train, validation and test sets from different models are summarized in Appendix A. Table 1 summarizes the test results for comparison. Note that only those general density estimators are compared here. Performances of the general version of real NVP is from [8]. We do not insert flip permutation in the models for MNIST density estimation since autoregressive models make more sense for this task (most people write digits from the top left corner down to the right bottom corner). From Tabel 1, we see that our models outperform their competitors by great margins, especially on the CIFAR-10 task. The ‘log’ nonlinearity significantly outperforms the tanh one in the CIFAR-10 task, perhaps due to a small .

### 6.3 Data generation

Randomly generated handwritten digits and image patches drawn from the best MNIST and CIFAR-10 density models are posted in Appendix B. Most digit samples are recognizable, although less generated image patches make sense to a human subject.

MNIST | CIFAR-10 | |

Best of MADE [7] | ||

Best of RealNVP [13] | ||

Best of MAF [8] | ||

TAN [15] | ||

UMNN [11] | ||

TriNet, tanh | ||

TriNet, log |

## 7 Conclusions

A parameter economy triangular neural network unit is designed for approximating arbitrary monotonic mappings with triangular Jacobians. Then, invertible triangular neural networks consisting of properly stacked such monotonic triangular neural network units and permutations are proposed for universal density estimations and data generations. The resultant models are compact, highly modular, readily invertible, and applicable to data with large dimensions. Experimental results on real world density estimation benchmarks have confirmed their superior performance.

## References

[1] B. W. Silverman, *Density Estimation for Statistics and Data
Analysis*. London: Chapman and Hall, 1986.

[2] J. E. Kolassa, *Series Approximation Methods in Statistics*,
3rd Edition. New York: Springer, 2006.

[3] A. J. Izenman, Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc., vol. 86, pp. 205–225, 1991.

[4] G. J. McLachlan, *Finite Mixture Models*, Wiley, 2000.

[5] H. Aapo and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, no. 4–5, vol. 13, pp. 411–430, 2000.

[6] S. S. Chen and R. A. Gopinath. Gaussianization. In Proceedings of NIPS, 2000.

[7] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.

[8] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.

[9] C. W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. In International Conference on Machine Learning, pages 2083–2092, 2018.

[10] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive flow. arXiv preprint, arXiv:1904.04676, 2019.

[11] A. Wehenkel, and G. Louppe, Unconstrained monotonic neural networks. In Conference on Neural Information Processing Systems, Vancouver, Canada, 2019.

[12] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent components estimation. In International Conference on Learning Representations (ICLR), 2015.

[13] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In International Conference on Learning Representations, 2017.

[14] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018.

[15] J. Oliva, A. Dubey, M. Zaheer, B. Poczos, R. Salakhutdinov, E. Xing, and J. Schneider. Transformation autoregressive networks. In International Conference on Machine Learning, pp. 3895–3904, 2018.

[16] W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: Free form continuous dynamics for scalable reversible generative models. In International Conference on Machine Learning, 2018.

## Appendix A: further density estimation results

MNIST | CIFAR-10 | |||||
---|---|---|---|---|---|---|

Train | Validation | Test | Train | Validation | Test | |

w/o aug., w/o flip, tanh | ||||||

w/o aug., w/o flip, log | ||||||

w/o aug., w flip, tanh | ||||||

w/o aug., w flip, log | ||||||

w aug., w/o flip, tanh | ||||||

w aug., w/o flip, log | ||||||

w aug., w flip, tanh | ||||||

w aug., w flip, log |

Optional data augmentation: randomly and circularly shift the image up to pixels both horizontally and vertically.

Optional permutation: flipping the outputs of each monotonic triangular network unit.

Nonlinearity: either tanh or log ().