A RAD approach to deep mixture models

03/18/2019 ∙ by Laurent Dinh, et al. ∙ Google 4

Flow based models such as Real NVP are an extremely powerful approach to density estimation. However, existing flow based models are restricted to transforming continuous densities over a continuous input space into similarly continuous distributions over continuous latent variables. This makes them poorly suited for modeling and representing discrete structures in data distributions, for example class membership or discrete symmetries. To address this difficulty, we present a normalizing flow architecture which relies on domain partitioning using locally invertible functions, and possesses both real and discrete valued latent variables. This Real and Discrete (RAD) approach retains the desirable normalizing flow properties of exact sampling, exact inference, and analytically computable probabilities, while at the same time allowing simultaneous modeling of both continuous and discrete structure in a data distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent generative models are one of the prevailing approaches for building expressive and tractable generative models. The generative process for a sample can be expressed as

where

is a noise vector, and

a parametric generator network

(typically a deep neural network). This paradigm has several incarnations, including

variational autoencoders

 (Kingma & Welling, 2014; Rezende et al., 2014), generative adversarial networks (Goodfellow et al., 2014), and flow based models (Baird et al., 2005; Tabak & Turner, 2013; Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018; Chen et al., 2018; Grathwohl et al., 2019).

The training process and model architecture for many existing latent generative models, and for all published flow based models, assumes a unimodal smooth distribution over latent variables . Given the parametrization of as a neural network, the mapping to is a continuous function. This imposed structure makes it challenging to model data distributions with discrete structure – for instance, multi-modal distributions, distributions with holes, distributions with discrete symmetries, or distributions that lie on a union of manifolds (as may approximately be true for natural images, see Tenenbaum et al., 2000). Indeed, such cases require the model to learn a generator whose input Jacobian has highly varying or infinite magnitude to separate the initial noise source into different clusters. Such variations imply a challenging optimization problem due to large changes in curvature. This shortcoming can be critical as several problems of interest are hypothesized to follow a clustering structure, i.e. the distributions is concentrated along several disjoint connected sets (Eghbal-zadeh et al., 2018).

A standard way to address this issue has been to use mixture models (Yeung et al., 2017; Richardson & Weiss, 2018; Eghbal-zadeh et al., 2018) or structured priors (Johnson et al., 2016). In order to efficiently parametrize the model, mixture models are often formulated as a discrete latent variable models (Hinton & Salakhutdinov, 2006; Courville et al., 2011; Mnih & Gregor, 2014; van den Oord et al., 2017), some of which can be expressed as a deep mixture model (Tang et al., 2012; Van den Oord & Schrauwen, 2014; van den Oord & Dambre, 2015). Although the resulting exponential number of mixture components with depth in deep mixture models is an advantage in terms of expressivity, it is an impediment to inference, evaluation, and training of such models, often requiring as a result the use of approximate methods like hard-Em or variational inference (Neal & Hinton, 1998).

In this paper we combine piecewise invertible functions, with discrete auxiliary variables selecting which invertible function applies, to describe a deep mixture model. This framework enables a probabilistic model’s latent space to have both real and discrete valued units, and to capture both continuous and discrete structure in the data distribution. It achieves this added capability while preserving the exact inference, exact sampling, exact evaluation of log-likelihood, and efficient training that make standard flow based models desirable.

2 Model definition

We aim to learn a parametrized distribution on the continuous input domain by maximizing log-likelihood. The major obstacle to training an expressive probabilistic model is typically efficiently evaluating log-likelihood.

2.1 Partitioning

If we consider a mixture model with a large number of components, the likelihood takes the form

In general, evaluating the likelihood requires computing probabilities for all components. However, following a strategy similar to Rainforth et al. (2018), if we partition the domain into disjoint subsets for such that and , constrain the support of to (i.e. ), and define a set identification function such that , we can write the likelihood as

This transforms the problem of summation to a search problem . This can be seen as the inferential converse of a stratified sampling strategy (Rubinstein & Kroese, 2016).

(a) Inference graph for flow based model.
(b) Sampling graph for flow based model.
(c) Inference graph for Rad model.
(d) Sampling graph for Rad model.
Figure 1: Stochastic computational graphs for inference and sampling for flow based models (0(a)0(b)) and a Rad model (0(c)0(d)). Note the dependency of on in 0(d). While this is not necessary, we will exploit this structure as highlighted later in the main text and in Figure 4.

2.2 Change of variable formula

The proposed approach will be a direct extension of flow based models (Rippel & Adams, 2013; Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018). Flow based models enable log-likelihood evaluation by relying on the change of variable formula

with a parametrized bijective function from onto and the absolute value of the determinant of its Jacobian.

As also proposed in Falorsi et al. (2019), we relax the constraint that be bijective, and instead have it be surjective onto and piecewise invertible. That is, we require be an invertible function, where indicates restricted to the domain . Given a distribution such that , we can define the following generative process:

If we use the set identification function associated with , the distribution corresponding to this stochastic inversion can be defined by a change of variable formula

(a) An example of a trimodal distribution , sinusoidal distribution. The different modes are colored in red, green, and blue.
(b) The resulting unimodal distribution , corresponding to the distribution of any of the initial modes in .
(c) An example of a piecewise invertible function aiming at transforming into a unimodal distribution. The red, green, and blue zones corresponds to the different modes in input space.
Figure 2: Example of a trimodal distribution (1(a)) turned into a unimodal distribution (1(b)) using a piecewise invertible function (1(c)). Note that the initial distribution correspond to an unfolding of as .

Because of the use of both Real and Discrete stochastic variables, we call this class of model Rad. The particular parametrization we use on is depicted in Figure 2. We rely on piecewise invertible functions that allow us to define a mixture model of repeated symmetrical patterns, following a method of folding the input space. Note that in this instance the function is implicitly defined by , as the discrete latent corresponds to which invertible component of the piecewise function falls on.

(a) Sampling.
(b) Inference.
Figure 3: Stochastic computational graph in a deep Rad mixture model of components.
(a) An example of a distribution (sinusoidal) with many modes.
(b) A simple absolute value function transforms into .
(c) The gating network allowing us to recover the distribution from , here taking the form and .
Figure 4: Illustration of the expressive power the gating distribution provides. By capturing the structure in a sine wave in , the function can take on an extremely simple form, corresponding only to a linear function with respect to .

So far, we have defined a mixture of components with disjoint support. However, if we factorize as , we can apply another piecewise invertible map to to define as another mixture model. Recursively applying this method results in a deep mixture model (see Figure 3).

Another advantage of such factorization is in the gating network , as also designated in (van den Oord & Dambre, 2015). It provides a more constrainted but less sample wasteful approach than rejection sampling (Bauer & Mnih, 2019) by taking into account the untransformed sample before selecting the mixture component . This allows the model to exploit the distribution in different regions in more complex ways than repeating it as a patternm as illustrated in Figure 4.

The function from the input to the discrete variables, , contains discontinuities. This presents the danger of introducing discontinuities into , making optimization more difficult. However, by carefully imposing boundary conditions on the gating network, we are able to exactly counteract the effect of discontinuities in , and cause to remain continuous with respect to the parameters. This is discussed in detail in Appendix A.

3 Experiments

3.1 Problems

We conduct a brief comparison on six two-dimensional toy problems with Real NVP to demonstrate the potential gain in expressivity Rad models can enable. Synthetic datasets of points each are constructed following the manifold hypothesis and/or the clustering hypothesis. We designate these problems as: grid Gaussian mixture, ring Gaussian mixture, two moons, two circles, spiral, and many moons (see Figure 5).

(a) Grid Gaussian mixture. This problem follows the clustering hypothesis.
(b) Ring Gaussian mixture. This problem also follows the clustering hypothesis but the clusters are not axis-aligned.
(c) Two moons. This problem not only follows the clustering hypothesis but also the manifold hypothesis.
(d) Two circles. This problem also follows both the clustering and manifold hypothesis. A continuous bijection cannot linearly separate those two clusters.
(e) Spiral. This problem follows only the manifold hypothesis.
(f) Many moons. This problem follows both the clustering and manifold hypotheses, with many clusters.
Figure 5: Samples drawn from the data distribution in each of several toy two dimensional problems.
(a) Forward pass.
(b) Inversion graph.
Figure 6: Computational graph of the coupling layers used in the experiments.

3.2 Architecture

For the Rad model implementation, we use the piecewise linear activations defined in Appendix A in a coupling layer architecture (Dinh et al., 2015, 2017) for

where, instead of a conditional linear transformation, the conditioning variable

determines the parameters of the piecewise linear activation on to obtain and , with (see Figure 6). For the gating network

, the gating logit neural network

take as input . We compare with a Real NVP model using only affine coupling layers.

is a standard Gaussian distribution.

As both these models can easily approximately solve these generative modeling tasks provided enough capacity, we study these model in a relatively low capacity regime, where we can showcase the potential expressivity Rad may provide. Each of these models uses six coupling layers, and each coupling layer uses a one-hidden-layer rectified network with a output activation scaled by a scalar parameter as described in Dinh et al. (2017). For Rad, the logit network also uses a one-hidden-layer rectified neural network, but with linear output. In order to fairly compare with respect to number of parameters, we provide Real NVP seven times more hidden units per hidden layer than Rad, which uses hidden units per hidden layer. For each level, and are trained using stochastic gradient ascent with Adam (Kingma & Ba, 2015) on the log-likelihood with a batch size of for steps.

3.3 Results

In each of these problems, Rad is consistently able to obtain higher log-likelihood than Real NVP.

Rad Real NVP
Grid Gaussian mixture
Ring Gaussian mixture
Two moons
Two cicles
Spiral
Many moons

3.3.1 Sampling and Gaussianization

(a) Real NVP on grid Gaussian mixture.
(b) Real NVP on ring Gaussian mixture.
(c) Real NVP on two moons.
(d) Real NVP on two circles.
(e) Real NVP on spiral.
(f) Real NVP on many moons.
(g) Rad on grid Gaussian mixture.
(h) Rad on ring Gaussian mixture.
(i) Rad on two moons.
(j) Rad on two circles.
(k) Rad on spiral.
(l) Rad on many moons.
Figure 7: Comparison of samples from trained Real NVP (top row) (a-f) and Rad (bottow row) (g-l) models. Real NVP fails in a low capacity setting by attributing probability mass over spaces where the data distribution has low density. Here, these spaces often connect data clusters, illustrating the challenges that come with modeling multimodal data as one continuous manifold.

We plot the samples (Figure 7) of the described Rad and Real NVP models trained on these problems. In the described low capacity regime, Real NVP fails by attributing probability mass over spaces where the data distribution has low density. This is consistent with the mode covering behavior of maximum likelihood. However, the particular inductive bias of Real NVP is to prefer modeling the data as one connected manifold. This results in the unwanted probability mass being distributed along the space between clusters.

Flow-based models often follow the principle of Gaussianization (Chen & Gopinath, 2001), i.e. transforming the data distribution into a Gaussian. The inversion of that process on a Gaussian distribution would therefore approximate the data distribution. We plot in Figure 8 the inferred Gaussianized variables for both models trained on the ring Gaussian mixture problem. The Gaussianization from Real NVP leaves some area of the standard Gaussian distribution unpopulated. These unattended areas correspond to unwanted regions of probability mass in the input space. Rad suffers significantly less from this problem.

(a) Real NVP Gaussianization.
(b) Rad Gaussianization.
Figure 8: Comparison of the Gaussianization process for Rad and Real NVP on the ring Gaussian mixture problem. Both plots show the image of data samples in the latent

variables, with level sets of the standard normal distribution plotted for reference.

Real NVP leaves some area of this Gaussian unpopulated, an effect which is not visually apparent for Rad.

An interesting feature is that Rad seems also to outperform Real NVP on the spiral dataset. One hypothesis is that the model successfully exploits some non-linear symmetries in this problem.

3.3.2 Folding

We take a deeper look at the Gaussianization process involved in both models. In Figure 9 we plot the inference process of from for both models trained on the two moons problem. As a result of a folding process similar to that in Montufar et al. (2014), several points which were far apart in the input space become neighbors in in the case of Rad.

(a) Real NVP inference.
(b) Rad inference.
Figure 9: Comparison of the inference process for Rad and Real NVP on the two moons problem. Each pane shows input samples embedded in different networks layers, progressing from left to right from earlier to later network layers. The points are colored according to their original position in the input space. In Rad several points which were far apart in the input space become neighbors in . This is not the case for Real NVP.

We further explore this folding process using the visualization described in Figure 10. We verify that the non-linear folding process induced by Rad plays at least two roles: bridging gaps in the distribution of probability mass, and exploiting symmetries in the data.

(a) Input points of a Rad layer. The red, green, and blue colors corresponds to different labels of the partition subsets ( values), domains of for different , where the function is non-invertible without knowing (see (b)). The black points are in the invertible area, where is not needed for the inversion.
(b) An example of piecewise linear function used in a Rad layer. The red, green, and blue colors corresponds to the different labels of the partition subsets in the non-invertible area. The dashed lines correspond to the non-invertible area in output space.
(c) Output points of a Rad layer. The red, green, and blue colors corresponds to the different labels of the partition subsets in the non-invertible area of the input space, where points are folded on top of each other. The black points are in the invertible area, where is not needed for the inversion. The dashed lines correspond to the non-invertible area in output space.
Figure 10: Understanding the folding process, and understanding other visualizations of the folding process.

We observe that in the case of the ring Gaussian mixture (Figure 10(a)), Rad effectively uses foldings in order to bridge the different modes of the distribution into a single mode, primarily in the last layers of the transformation. We contrast this with Real NVP (Figure 10(b)) which struggles to combine these modes under the standard Gaussian distribution using bijections.

In the spiral problem (Figure 12), Rad decomposes the spiral into three different lines to bridge (Figure 11(a)) instead of unrolling the manifold fully, which Real NVP struggles to do (Figure 11(b)).

In both cases, the points remain generally well separated by labels, even after being pushed through a Rad layer (Figure 10(a) and 11(a)). This enables the model to maximize the conditional log-probability .

4 Conclusion

We introduced an approach to tractably evaluate and train deep mixture models using piecewise invertible maps as a folding mechanism. This allows exact inference, exact generation, and exact evaluation of log-likelihood, avoiding many issues in previous discrete variables models. This method can easily be combined with other flow based architectural components, allowing flow based models to better model datasets with discrete as well as continuous structure.

(a) Rad folding strategy on the ring Gaussian mixture problem. The top rows correspond to each Rad layer’s input points, and the bottom rows to its output points, as shown in 10. The labels tends to be well separated in output space as well.
(b) Real NVP inference strategy on the ring Gaussian mixture problem. The points are colored according to their original position in the input space.
Figure 11: Rad and Real NVP inference processes on the ring Gaussian mixture problem. Each column correspond to a Rad or affine coupling layer. Rad effectively uses foldings in order to bridge the multiple modes of the distribution into a single mode, primarily in the last layers of the transformation, whereas Real NVP struggles to bring together these modes under the standard Gaussian distribution using continuous bijections.
(a) Rad folding strategy on the spiral problem. The top rows correspond to each Rad layer’s input points, and the bottom rows to its output points, as shown in 10.
(b) Real NVP inference strategy on the spiral problem. The points are colored according to their original position in the input space.
Figure 12: Rad and Real NVP inference processes on the spiral problem. Each column correspond to a Rad or affine coupling layer. Instead of unrolling the manifold as Real NVP tries to, Rad uses a more successful strategy of decomposing the spiral into three different lines that it later bridges.

References

  • Baird et al. (2005) Leemon Baird, David Smalenberger, and Shawn Ingkiriwang. One-step neural network inversion with pdf learning and emulation. In International Joint Conference on Neural Networks, volume 2, pp. 966–971. IEEE, 2005.
  • Bauer & Mnih (2019) Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. In

    Proceedings of the twenty-second international conference on artificial intelligence and statistics

    , 2019.
  • Chen & Gopinath (2001) Scott Shaobing Chen and Ramesh A Gopinath. Gaussianization. In Advances in neural information processing systems, pp. 423–429, 2001.
  • Chen et al. (2018) Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud.

    Neural ordinary differential equations.

    In Advances in Neural Information Processing Systems, pp. 6572–6583, 2018.
  • Courville et al. (2011) Aaron Courville, James Bergstra, and Yoshua Bengio.

    A spike and slab restricted boltzmann machine.

    In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 233–241, 2011.
  • Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In International Conference on Learning Representations: Workshop Track, 2015.
  • Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017.
  • Eghbal-zadeh et al. (2018) Hamid Eghbal-zadeh, Werner Zellinger, and Gerhard Widmer. Mixture density generative adversarial networks.

    Neural Information Processing Systems: Bayesian Deep Learning Workshop

    , 2018.
  • Falorsi et al. (2019) Luca Falorsi, Pim de Haan, Tim R. Davidson, and Patrick Forré. Reparameterizing distributions on lie groups. In Proceedings of the twenty-second international conference on artificial intelligence and statistics, 2019.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
  • Grathwohl et al. (2018) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations, 2018.
  • Grathwohl et al. (2019) Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. 2019.
  • Hinton & Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.
  • Johnson et al. (2016) Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • Kingma & Dhariwal (2018) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236–10245, 2018.
  • Maddison et al. (2017) Chris J Maddison, Andriy Mnih, and Yee Whye Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In International Conference on Learning Representations, 2017.
  • Mnih & Gregor (2014) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In

    International Conference on Machine Learning

    , 2014.
  • Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932, 2014.
  • Neal & Hinton (1998) Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pp. 355–368. Springer, 1998.
  • Rainforth et al. (2018) Tom Rainforth, Yuan Zhou, Xiaoyu Lu, Yee Whye Teh, Frank Wood, Hongseok Yang, and Jan-Willem van de Meent. Inference trees: Adaptive inference with exploration. arXiv preprint arXiv:1806.09550, 2018.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
  • Richardson & Weiss (2018) Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pp. 5852–5863, 2018.
  • Rippel & Adams (2013) Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125, 2013.
  • Rolfe (2017) Jason Tyler Rolfe. Discrete variational autoencoders. In International Conference on Learning Representations, 2017.
  • Rubinstein & Kroese (2016) Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.
  • Tabak & Turner (2013) EG Tabak and Cristina V Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
  • Tang et al. (2012) Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Deep mixtures of factor analysers. In International Conference on Machine Learning, 2012.
  • Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • Tucker et al. (2017) George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein.

    Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models.

    In Advances in Neural Information Processing Systems, pp. 2627–2636, 2017.
  • van den Oord & Dambre (2015) Aäron van den Oord and Joni Dambre. Locally-connected transformations for deep gmms. In International Conference on Machine Learning (ICML): Deep Learning Workshop, pp. 1–8, 2015.
  • Van den Oord & Schrauwen (2014) Aaron Van den Oord and Benjamin Schrauwen.

    Factoring variations in natural images with deep gaussian mixture models.

    In Advances in Neural Information Processing Systems, pp. 3518–3526, 2014.
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315, 2017.
  • Yeung et al. (2017) Serena Yeung, Anitha Kannan, Yann Dauphin, and Li Fei-Fei. Tackling over-pruning in variational autoencoders. arXiv preprint arXiv:1706.03643, 2017.

Appendix A Continuity

The standard approach in learning a deep probabilistic model has been stochastic gradient descent on the negative log-likelihood. Although the model enables the computation of a gradient almost everywhere, the log-likelihood is unfortunately discontinuous. Let’s decompose the log-likelihood

There are two sources of discontinuity in this expression: is a function with discrete values (therefore discontinuous) and is discontinuous because of the transition between the subsets , leading to the expression of interest

which takes a role similar to the log-Jacobian determinant, a pseudo log-Jacobian determinant.

(a) A simple piecewise linear function with three linear pieces but that cannot respect boundary conditions.
(b) A simple piecewise linear function with five linear pieces respecting boundary conditions.
(c) The function associated to either of these piecewise linear functions.
Figure 13: Simple piecewise linear scalar function before 12(a) and after 12(b) respecting boundary conditions. The colored area correspond to the different indices for the mixture components, lighter color for non-invertible areas. The dashed line correspond to the non-invertible area in the output space. In 12(c), we show the function resulting from these nonlinearities.

Let’s build from now on the simple scalar case and a piecewise linear function

or (with and , see figure12(a)), then and .

In this case, can be seen as a vector valued function. We can attempt at parametrizing the model such that the pseudo log-Jacobian determinant becomes continuous with respect to by expressing the boundary condition at

If we define and , with , then this boundary condition can be enforced, together with a similar one at , by replacing the function with

Another type of boundary condition can be found at between the non-invertible area and the invertible area , as , therefore

Since the condition when will lead to an infinite loss barrier at , another way to enforce this boundary condition is by adding linear pieces (Figure 12(b)):

The inverse is defined as

In order to know the values of at the boundaries , we can use the logit function

where .

Given those constraints, the model can then be reliably learned through gradient descent methods. Note that the resulting tractability of the model results from the fact that the discrete variables is only interfaced during inference with the distribution , unlike discrete variational autoencoders approaches (Mnih & Gregor, 2014; van den Oord et al., 2017) where it is fed to a deep neural network. Similar to Rolfe (2017), the learning of discrete variables is achieved by relying on the the continuous component of the model, and, as opposed as other approaches (Jang et al., 2017; Maddison et al., 2017; Grathwohl et al., 2018; Tucker et al., 2017), this gradient signal extracted is exact and closed form.

Appendix B Inference processes

We plot the remaining inference processes of Rad and Real NVP on the remaining problems not plotted previously: grid Gaussian mixture (Figure 14), two circles (Figure 15), two moons (Figure 16), and many moons (Figure 17). We also compare the final results of the Gaussianization processes on both models on the different toy problems in Figure 18.

(a) Rad folding strategy on the grid Gaussian mixture problem. The top rows correspond to a Rad layer input points, and the bottom rows to its output points, as shown in 10.
(b) Real NVP inference strategy on the grid Gaussian mixture problem. The points are colored according to their original position in the input space.
Figure 14: Rad and Real NVP inference process on the grid Gaussian mixture problem. Each column correspond to a Rad or affine coupling layer.
(a) Rad folding strategy on the two circles problem. The top rows correspond to a Rad layer input points, and the bottom rows to its output points, as shown in 10.
(b) Real NVP inference strategy on the two circles problem. The points are colored according to their original position in the input space.
Figure 15: Rad and Real NVP inference process on the two circles problem. Each column correspond to a Rad or affine coupling layer.
(a) Rad folding strategy on the two moons problem. The top rows correspond to a Rad layer input points, and the bottom rows to its output points, as shown in 10.
(b) Real NVP inference strategy on the two moons problem. The points are colored according to their original position in the input space.
Figure 16: Rad and Real NVP inference process on the two moons problem. Each column correspond to a Rad or affine coupling layer.
(a) Rad folding strategy on the many moons problem. The top rows correspond to a Rad layer input points, and the bottom rows to its output points, as shown in 10.
(b) Real NVP inference strategy on the many moons problem. The points are colored according to their original position in the input space.
Figure 17: Rad and Real NVP inference process on the many moons problem. Each column correspond to a Rad or affine coupling layer.
(a) Real NVP on grid Gaussian mixture.
(b) Real NVP on ring Gaussian mixture.
(c) Real NVP on two moons.
(d) Real NVP on two circles.
(e) Real NVP on spiral.
(f) Real NVP on many moons.
(g) Rad on grid Gaussian mixture.
(h) Rad on ring Gaussian mixture.
(i) Rad on two moons.
(j) Rad on two circles.
(k) Rad on spiral.
(l) Rad on many moons.
Figure 18: Comparison of the Gaussianization from the trained Real NVP (top row) (a-f) and Rad (bottow row) (g-l). Real NVP fails in a low capacity setting by leaving unpopulated areas where the standard Gaussian attributes probability mass. Here, these spaces as often ones separating clusters, showing the failure in modeling the data as one manifold.