1 Introduction
Latent generative models are one of the prevailing approaches for building expressive and tractable generative models. The generative process for a sample can be expressed as
where
is a noise vector, and
a parametric generator network(typically a deep neural network). This paradigm has several incarnations, including
variational autoencoders
(Kingma & Welling, 2014; Rezende et al., 2014), generative adversarial networks (Goodfellow et al., 2014), and flow based models (Baird et al., 2005; Tabak & Turner, 2013; Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018; Chen et al., 2018; Grathwohl et al., 2019).The training process and model architecture for many existing latent generative models, and for all published flow based models, assumes a unimodal smooth distribution over latent variables . Given the parametrization of as a neural network, the mapping to is a continuous function. This imposed structure makes it challenging to model data distributions with discrete structure – for instance, multimodal distributions, distributions with holes, distributions with discrete symmetries, or distributions that lie on a union of manifolds (as may approximately be true for natural images, see Tenenbaum et al., 2000). Indeed, such cases require the model to learn a generator whose input Jacobian has highly varying or infinite magnitude to separate the initial noise source into different clusters. Such variations imply a challenging optimization problem due to large changes in curvature. This shortcoming can be critical as several problems of interest are hypothesized to follow a clustering structure, i.e. the distributions is concentrated along several disjoint connected sets (Eghbalzadeh et al., 2018).
A standard way to address this issue has been to use mixture models (Yeung et al., 2017; Richardson & Weiss, 2018; Eghbalzadeh et al., 2018) or structured priors (Johnson et al., 2016). In order to efficiently parametrize the model, mixture models are often formulated as a discrete latent variable models (Hinton & Salakhutdinov, 2006; Courville et al., 2011; Mnih & Gregor, 2014; van den Oord et al., 2017), some of which can be expressed as a deep mixture model (Tang et al., 2012; Van den Oord & Schrauwen, 2014; van den Oord & Dambre, 2015). Although the resulting exponential number of mixture components with depth in deep mixture models is an advantage in terms of expressivity, it is an impediment to inference, evaluation, and training of such models, often requiring as a result the use of approximate methods like hardEm or variational inference (Neal & Hinton, 1998).
In this paper we combine piecewise invertible functions, with discrete auxiliary variables selecting which invertible function applies, to describe a deep mixture model. This framework enables a probabilistic model’s latent space to have both real and discrete valued units, and to capture both continuous and discrete structure in the data distribution. It achieves this added capability while preserving the exact inference, exact sampling, exact evaluation of loglikelihood, and efficient training that make standard flow based models desirable.
2 Model definition
We aim to learn a parametrized distribution on the continuous input domain by maximizing loglikelihood. The major obstacle to training an expressive probabilistic model is typically efficiently evaluating loglikelihood.
2.1 Partitioning
If we consider a mixture model with a large number of components, the likelihood takes the form
In general, evaluating the likelihood requires computing probabilities for all components. However, following a strategy similar to Rainforth et al. (2018), if we partition the domain into disjoint subsets for such that and , constrain the support of to (i.e. ), and define a set identification function such that , we can write the likelihood as
This transforms the problem of summation to a search problem . This can be seen as the inferential converse of a stratified sampling strategy (Rubinstein & Kroese, 2016).
2.2 Change of variable formula
The proposed approach will be a direct extension of flow based models (Rippel & Adams, 2013; Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018). Flow based models enable loglikelihood evaluation by relying on the change of variable formula
with a parametrized bijective function from onto and the absolute value of the determinant of its Jacobian.
As also proposed in Falorsi et al. (2019), we relax the constraint that be bijective, and instead have it be surjective onto and piecewise invertible. That is, we require be an invertible function, where indicates restricted to the domain . Given a distribution such that , we can define the following generative process:
If we use the set identification function associated with , the distribution corresponding to this stochastic inversion can be defined by a change of variable formula
Because of the use of both Real and Discrete stochastic variables, we call this class of model Rad. The particular parametrization we use on is depicted in Figure 2. We rely on piecewise invertible functions that allow us to define a mixture model of repeated symmetrical patterns, following a method of folding the input space. Note that in this instance the function is implicitly defined by , as the discrete latent corresponds to which invertible component of the piecewise function falls on.
So far, we have defined a mixture of components with disjoint support. However, if we factorize as , we can apply another piecewise invertible map to to define as another mixture model. Recursively applying this method results in a deep mixture model (see Figure 3).
Another advantage of such factorization is in the gating network , as also designated in (van den Oord & Dambre, 2015). It provides a more constrainted but less sample wasteful approach than rejection sampling (Bauer & Mnih, 2019) by taking into account the untransformed sample before selecting the mixture component . This allows the model to exploit the distribution in different regions in more complex ways than repeating it as a patternm as illustrated in Figure 4.
The function from the input to the discrete variables, , contains discontinuities. This presents the danger of introducing discontinuities into , making optimization more difficult. However, by carefully imposing boundary conditions on the gating network, we are able to exactly counteract the effect of discontinuities in , and cause to remain continuous with respect to the parameters. This is discussed in detail in Appendix A.
3 Experiments
3.1 Problems
We conduct a brief comparison on six twodimensional toy problems with Real NVP to demonstrate the potential gain in expressivity Rad models can enable. Synthetic datasets of points each are constructed following the manifold hypothesis and/or the clustering hypothesis. We designate these problems as: grid Gaussian mixture, ring Gaussian mixture, two moons, two circles, spiral, and many moons (see Figure 5).
3.2 Architecture
For the Rad model implementation, we use the piecewise linear activations defined in Appendix A in a coupling layer architecture (Dinh et al., 2015, 2017) for
where, instead of a conditional linear transformation, the conditioning variable
determines the parameters of the piecewise linear activation on to obtain and , with (see Figure 6). For the gating network, the gating logit neural network
take as input . We compare with a Real NVP model using only affine coupling layers.is a standard Gaussian distribution.
As both these models can easily approximately solve these generative modeling tasks provided enough capacity, we study these model in a relatively low capacity regime, where we can showcase the potential expressivity Rad may provide. Each of these models uses six coupling layers, and each coupling layer uses a onehiddenlayer rectified network with a output activation scaled by a scalar parameter as described in Dinh et al. (2017). For Rad, the logit network also uses a onehiddenlayer rectified neural network, but with linear output. In order to fairly compare with respect to number of parameters, we provide Real NVP seven times more hidden units per hidden layer than Rad, which uses hidden units per hidden layer. For each level, and are trained using stochastic gradient ascent with Adam (Kingma & Ba, 2015) on the loglikelihood with a batch size of for steps.
3.3 Results
In each of these problems, Rad is consistently able to obtain higher loglikelihood than Real NVP.
Rad  Real NVP  

Grid Gaussian mixture  
Ring Gaussian mixture  
Two moons  
Two cicles  
Spiral  
Many moons 
3.3.1 Sampling and Gaussianization
We plot the samples (Figure 7) of the described Rad and Real NVP models trained on these problems. In the described low capacity regime, Real NVP fails by attributing probability mass over spaces where the data distribution has low density. This is consistent with the mode covering behavior of maximum likelihood. However, the particular inductive bias of Real NVP is to prefer modeling the data as one connected manifold. This results in the unwanted probability mass being distributed along the space between clusters.
Flowbased models often follow the principle of Gaussianization (Chen & Gopinath, 2001), i.e. transforming the data distribution into a Gaussian. The inversion of that process on a Gaussian distribution would therefore approximate the data distribution. We plot in Figure 8 the inferred Gaussianized variables for both models trained on the ring Gaussian mixture problem. The Gaussianization from Real NVP leaves some area of the standard Gaussian distribution unpopulated. These unattended areas correspond to unwanted regions of probability mass in the input space. Rad suffers significantly less from this problem.
variables, with level sets of the standard normal distribution plotted for reference.
Real NVP leaves some area of this Gaussian unpopulated, an effect which is not visually apparent for Rad.An interesting feature is that Rad seems also to outperform Real NVP on the spiral dataset. One hypothesis is that the model successfully exploits some nonlinear symmetries in this problem.
3.3.2 Folding
We take a deeper look at the Gaussianization process involved in both models. In Figure 9 we plot the inference process of from for both models trained on the two moons problem. As a result of a folding process similar to that in Montufar et al. (2014), several points which were far apart in the input space become neighbors in in the case of Rad.
We further explore this folding process using the visualization described in Figure 10. We verify that the nonlinear folding process induced by Rad plays at least two roles: bridging gaps in the distribution of probability mass, and exploiting symmetries in the data.
We observe that in the case of the ring Gaussian mixture (Figure 10(a)), Rad effectively uses foldings in order to bridge the different modes of the distribution into a single mode, primarily in the last layers of the transformation. We contrast this with Real NVP (Figure 10(b)) which struggles to combine these modes under the standard Gaussian distribution using bijections.
4 Conclusion
We introduced an approach to tractably evaluate and train deep mixture models using piecewise invertible maps as a folding mechanism. This allows exact inference, exact generation, and exact evaluation of loglikelihood, avoiding many issues in previous discrete variables models. This method can easily be combined with other flow based architectural components, allowing flow based models to better model datasets with discrete as well as continuous structure.
References
 Baird et al. (2005) Leemon Baird, David Smalenberger, and Shawn Ingkiriwang. Onestep neural network inversion with pdf learning and emulation. In International Joint Conference on Neural Networks, volume 2, pp. 966–971. IEEE, 2005.

Bauer & Mnih (2019)
Matthias Bauer and Andriy Mnih.
Resampled priors for variational autoencoders.
In
Proceedings of the twentysecond international conference on artificial intelligence and statistics
, 2019.  Chen & Gopinath (2001) Scott Shaobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in neural information processing systems, pp. 423–429, 2001.

Chen et al. (2018)
Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.
Neural ordinary differential equations.
In Advances in Neural Information Processing Systems, pp. 6572–6583, 2018. 
Courville et al. (2011)
Aaron Courville, James Bergstra, and Yoshua Bengio.
A spike and slab restricted boltzmann machine.
In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 233–241, 2011.  Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. In International Conference on Learning Representations: Workshop Track, 2015.
 Dinh et al. (2017) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017.

Eghbalzadeh et al. (2018)
Hamid Eghbalzadeh, Werner Zellinger, and Gerhard Widmer.
Mixture density generative adversarial networks.
Neural Information Processing Systems: Bayesian Deep Learning Workshop
, 2018.  Falorsi et al. (2019) Luca Falorsi, Pim de Haan, Tim R. Davidson, and Patrick Forré. Reparameterizing distributions on lie groups. In Proceedings of the twentysecond international conference on artificial intelligence and statistics, 2019.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Grathwohl et al. (2018) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. In International Conference on Learning Representations, 2018.
 Grathwohl et al. (2019) Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Freeform continuous dynamics for scalable reversible generative models. 2019.
 Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations, 2017.
 Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
 Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations, 2014.
 Kingma & Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.

Maddison et al. (2017)
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations, 2017. 
Mnih & Gregor (2014)
Andriy Mnih and Karol Gregor.
Neural variational inference and learning in belief networks.
In
International Conference on Machine Learning
, 2014.  Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
 Neal & Hinton (1998) Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Springer, 1998.
 Rainforth et al. (2018) Tom Rainforth, Yuan Zhou, Xiaoyu Lu, Yee Whye Teh, Frank Wood, Hongseok Yang, and JanWillem van de Meent. Inference trees: Adaptive inference with exploration. arXiv preprint arXiv:1806.09550, 2018.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
 Richardson & Weiss (2018) Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pp. 5852–5863, 2018.
 Rippel & Adams (2013) Oren Rippel and Ryan Prescott Adams. Highdimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013.
 Rolfe (2017) Jason Tyler Rolfe. Discrete variational autoencoders. In International Conference on Learning Representations, 2017.
 Rubinstein & Kroese (2016) Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.
 Tabak & Turner (2013) EG Tabak and Cristina V Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
 Tang et al. (2012) Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Deep mixtures of factor analysers. In International Conference on Machine Learning, 2012.
 Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.

Tucker et al. (2017)
George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha
SohlDickstein.
Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models.
In Advances in Neural Information Processing Systems, pp. 2627–2636, 2017.  van den Oord & Dambre (2015) Aäron van den Oord and Joni Dambre. Locallyconnected transformations for deep gmms. In International Conference on Machine Learning (ICML): Deep Learning Workshop, pp. 1–8, 2015.

Van den Oord & Schrauwen (2014)
Aaron Van den Oord and Benjamin Schrauwen.
Factoring variations in natural images with deep gaussian mixture models.
In Advances in Neural Information Processing Systems, pp. 3518–3526, 2014.  van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
 Yeung et al. (2017) Serena Yeung, Anitha Kannan, Yann Dauphin, and Li FeiFei. Tackling overpruning in variational autoencoders. arXiv preprint arXiv:1706.03643, 2017.
Appendix A Continuity
The standard approach in learning a deep probabilistic model has been stochastic gradient descent on the negative loglikelihood. Although the model enables the computation of a gradient almost everywhere, the loglikelihood is unfortunately discontinuous. Let’s decompose the loglikelihood
There are two sources of discontinuity in this expression: is a function with discrete values (therefore discontinuous) and is discontinuous because of the transition between the subsets , leading to the expression of interest
which takes a role similar to the logJacobian determinant, a pseudo logJacobian determinant.
Let’s build from now on the simple scalar case and a piecewise linear function
or (with and , see figure12(a)), then and .
In this case, can be seen as a vector valued function. We can attempt at parametrizing the model such that the pseudo logJacobian determinant becomes continuous with respect to by expressing the boundary condition at
If we define and , with , then this boundary condition can be enforced, together with a similar one at , by replacing the function with
Another type of boundary condition can be found at between the noninvertible area and the invertible area , as , therefore
Since the condition when will lead to an infinite loss barrier at , another way to enforce this boundary condition is by adding linear pieces (Figure 12(b)):
The inverse is defined as
In order to know the values of at the boundaries , we can use the logit function
where .
Given those constraints, the model can then be reliably learned through gradient descent methods. Note that the resulting tractability of the model results from the fact that the discrete variables is only interfaced during inference with the distribution , unlike discrete variational autoencoders approaches (Mnih & Gregor, 2014; van den Oord et al., 2017) where it is fed to a deep neural network. Similar to Rolfe (2017), the learning of discrete variables is achieved by relying on the the continuous component of the model, and, as opposed as other approaches (Jang et al., 2017; Maddison et al., 2017; Grathwohl et al., 2018; Tucker et al., 2017), this gradient signal extracted is exact and closed form.
Appendix B Inference processes
We plot the remaining inference processes of Rad and Real NVP on the remaining problems not plotted previously: grid Gaussian mixture (Figure 14), two circles (Figure 15), two moons (Figure 16), and many moons (Figure 17). We also compare the final results of the Gaussianization processes on both models on the different toy problems in Figure 18.