A fundamental question in representation learning relates to identifiability: when is it possible to recover the true latent representations that generate the observed data? Most existing approaches for deep generative modelling, such as Variational Autoencoders (VAE) (Kingma and Welling, 2013) and flow-based methods (Kobyzev et al., 2019), focus on learning latent-variable distributions and generating realistic data samples, but do not address the question of identifiability, i.e. recovering the true latent representations.
The question of identifiability is closely related to the goal of learning disentangled representations (Bengio et al., 2013). A disentangled representation is defined as one where individual latent units are sensitive to changes in single generative factors, while being relatively invariant to nuisance factors (Bengio et al., 2013)
. A good representation for human faces, for example, should encompass different latent factors that separately encode different characteristics including gender, hair color, facial expression, etc. By aiming to recover the true latent representation, identifiable models also allow for principled disentanglement; this suggests that rather than being entangled in disentanglement learning in a completely unsupervised manner, we go a step further towards identifiability, since existing literature on disentangled representation learning, such as-VAE (Higgins et al., 2017), -TCVAE (Chen et al., 2018), DIP-VAE (Kumar et al., 2017) and FactorVAE (Kim and Mnih, 2018) are neither general endeavors to achieve identifiability, nor do they provide theoretical guarantees on recovering the true latent sources.
Recently, Khemakhem et al. (2019) introduced a theory of identifiability for deep generative models, based upon which they proposed an identifiable variant of VAEs called iVAE, to learn the distribution over latent variables in an identifiable manner. However, the downside of learning such an identifiable model within the VAE framework lies in the intractability of KL divergence between the approximate posterior and the true posterior. Therefore, in both theory and practice, iVAE inevitably leads to a suboptimal solution, which, rigorously speaking, renders the learned model unidentifiable.
). A normalizing flow is a transformation of a simple probability distribution (e.g. a standard normal) into a more complex probability distribution by a composition of a series of invertible and differentiable mappings(Kobyzev et al., 2019). Hence, they can be exploited to effectively model complex probability distributions. In contrast to VAEs relying on variational approximations, flow-based models allow for latent-variable inference and likelihood evaluation in an exact and efficient manner, making them a perfect choice for identifiability.
To this end, unifying identifiablity with flows, we propose iFlow, a framework for deep latent-variable models which allows for recovery of the true latent representations from which the observed data originates. We demonstrate that our flow-based model makes it possible to directly maximize the conditional marginal likelihood and thus achieves identifiability in a rigorous manner. We provide theoretical guarantees on the recovery of the true latent representations, and show experiments on synthetic data to validate the theoretical and practical advantages of our proposed formulation over previous approaches. We will release our source code shortly.
The objective of generative models is to model the data distribution, which can be arbitrarily complex. Normalizing Flows are a family of generative models that learns an invertible mapping between the observed data and certain latent variables over which a tractable distribution is defined. Formally, let
be an observed random variable, anda latent variable with a tractable distribution. Let be an invertible function such that
. By using the change of variable formula, the probability density function (pdf) ofis given by
where is the inverse of . To approximate an arbitrarily complex nonlinear invertible bijection, we can compose a series of such functions, since the composition of invertible functions is also invertible, and its Jacobian determinant is the product of the individual functions’ Jacobian determinants. Specifically, let , , …, be a set of invertible functions with their corresponding inverses , , …, . Then, the probability density function (pdf) of can be obtained by successively transforming through a sequence of invertible functions ’s:
where and .
3 Related Work
Nonlinear ICA is a fundamental task in unsupervised learning that has attracted a great amount of attention in recent years. Given the observations alone, it aims to recover the inverse mixing function as well as their corresponding independent sources. In contrast with the linear case, research on nonlinear ICA is hampered by the fact that without auxiliary variables, recovering the independent latents is impossible(Hyvärinen and Pajunen, 1999). Similar impossibility result can be found in (Locatello et al., 2018). Fortunately, by exploiting additional temporal structure on the sources, recent work (Hyvarinen and Morioka, 2016; Hyvarinen et al., 2018) established the first identifiability results for deep latent-variable models. These approaches, however, do not explicitly learn the data distribution, nor are they capable of generating “fake” data.
Khemakhem et al. (2019)
bridged this gap by establishing a principled connection between VAEs and an identifiable model for nonlinear ICA. Their method with an identifiable VAE (known as iVAE) approximates the true joint distribution over observed and latent variables under mild conditions. However, due to the intractablity of KL divergence between variational approximate posterior and the true posterior, iVAE maximizes the evidence lower bound on the data log-likelihood, which in both theory and practice inevitably leads to suboptimal identifying performance.
We instead propose identifying through flows (normalizing flow), which maximizes the likelihood in a straightforward way, providing theoretical guarantees and practical advantages for identifiability.
Normalizing Flows Normalizing Flows are a family of generative approaches that models a data distribution by learning a bijection from observations to latent codes, and vice versa. Compared with VAEs which learn a posterior approximation to the true posterior, normalizing flows directly deal with marginal likelihood with exact inference while maintaining efficient sampling. Formally, a normalizing flow is a transform of a tractable probability distribution into a complex distribution by compositing a sequence of invertible and differentiable mappings. In practice, the challenge lies in designing a normalizing flow that satisfies the following conditions: (1) it should be bijective and thus invertible; (2) it is efficient to compute its inverse and its Jacobian determinant while maintaining sufficient capabilities.
The framework of normalizing flows was first defined in (Tabak et al., 2010) and (Tabak and Turner, 2013) and then explored for density estimation in (Rippel and Adams, 2013). Rezende and Mohamed (2015) applied normalizing flows to variational inference by introducing planar and radial flows. Since then, various flows have been proposed. Kingma and Dhariwal (2018) parameterizes linear flows with the LU factorization and “” convolutions for the sake of efficient determinant calculation and invertibility of convolution operations. Despite their limits in expressive capabilities, linear flows serve as essential building blocks of affine coupling flows as in (Dinh et al., 2014, 2016). Kingma et al. (2016)
applied autoregressive models as a form of normalizing flows, which exhibit strong expressiveness in modelling statistical dependencies among variables. However, the forwarding operation of autoregressive models is inherently sequential, which makes it inefficient for training. Splines have also been used as building blocks of normalizing flows:Müller et al. (2018) suggested modelling a linear and quadratic spline as the integral of a univariate monotonic function for flow construction. Durkan et al. (2019a) proposed a natural extension to the framework of neural importance sampling and also suggested modelling a coupling layer as a monotonic rational-quadratic spine (Durkan et al., 2019b), which can be implemented either with a coupling architecture RQ-NSF(C) or with autoregressive architecture RQ-NSF(AR).
The expressive capabilities of normalizing flows and their theoretical guarantee of invertibility make them a natural choice for recovering the true mixing mapping from sources to observations, and thus identifiability can be rigorously achieved. In our work, we show that by introducing normalizing flows it is possible to learn an identifiable latent-variable model with theoretical guarantees of identifiability.
4 Identifiable Flow
In this section, we first introduce the identifiable latent-variable family and the theory of identifiability that makes it possible to recover the joint distribution between observations and latent variables. Then we derive our model, iFlow, and its optimization objective which leads to principled disentanglement with theoretical guarantees of identifiability.
4.1 Identifiable Latent-variable Family
The primary assumption leading to identifiability is a conditionally factorized prior distribution over the latent variables, , where is an auxiliary variable, which can be the time index in a time series, categorical label, or an additionally observed variable (Khemakhem et al., 2019).
Formally, let and be two observed random variables, and a latent variable that is the source of . This implies that there can be an arbitrarily complex nonlinear mapping . Assuming that is a bijection, it is desirable to recover its inverse by approximating using a family of invertible mappings parameterized by . The statistical dependencies among these random variables are defined by a Bayesian net: , from which the following conditional generative model can be derived:
where and is assumed to be a factorized exponential family distribution conditioned upon . Note that this density assumption is valid in most cases, since the exponential families have universal approximation capabilities (Sriperumbudur et al., 2017). Specifically, the probability density function is given by
where is the base measure, is the normalizing constant, ’s are the components of the sufficient statistic and the natural parameters, critically depending on . Note that indicates the maximum order of statistics under consideration.
4.2 Identifiability Theory
The objective of identifiability is to learn a model that is subject to:
where and are two different choices of model parameters that imply the same marginal density. One possible way to achieve this objective is to introduce the definition of identifiability up to equivalence class:
(Identifiability up to equivalence class) Let be an equivalence relation on . A model defined by is said to be identifiable up to if
where such an equivalence relation in the identifiable latent-variable family is defined as follows:
and are of the same equivalence class if and only if there exist and such that ,
One can easily verify that is an equivalence relation by showing its reflexivity, symmetry and transitivity. Then, the identifiability of the latent-variable family is given by Theorem 4.1 (Khemakhem et al., 2019).
Let and suppose the following holds: (i) The set has measure zero, where
is the characteristic function of the density; (ii) The sufficient statistics in (2) are differentiable almost everywhere and almost surely for and for all and . (iii) There exist () distinct priors such that the matrix
of size is invertible. Then, the parameters are -identifiable.
4.3 Optimization Objective of iFlow
We propose identifying through flows (iFlow) for recovering latent representations. Our proposed model falls into the identifiable latent-variable family with , that is, , where is a point mass, i.e. Dirac measure. Note that assumption (i) in Theorem 4.1 holds true for iFlow. In stark contrast to iVAE which resorts to variational approximations and maximizes the evidence lower bound, iFlow directly maximizes the marginal likelihood conditioned on :
where is modeled by a factorized exponential family distribution. Therefore, the log marginal likelihood is given by
where is the th component of the source , and and are both -by- matrices. Here, is a normalizing flow of any kind. For the sake of simplicity, we set for all ’s and consider maximum order of sufficient statistics of ’s up to 2, that is, . Hence, and are given by
Therefore, the optimization objective is to minimize
where denotes the empirical distribution, and the first term in (10) is given by
can be modelled by a muli-layer perceptron with learnable parameters, where . Here, is the dimension of the space in which ’s lies. Note that should be strictly negative in order for the exponential family’s probability density function to be finite. Negative softplus activation can be exploited to force this constraint. Therefore, the optimization objective has the following closed-form to be optimized:
4.4 Identifiability of iFlow
The identifiability of our proposed model, iFlow, is characterized by Theorem 4.2.
Minimizing with respect to , in the limit of infinite data, learns a model that is -identifiable.
Minimizing with respect to is equivalent to maximizing the log conditional likelihood, . Given infinite amount of data, maximizing will give us the true marginal likelihood conditioned on , that is, , where and is the true parameter. According to Theorem 4.1, we obtain that and are of the same equivalence class defined by . Thus, according to Definition 4.1, the joint distribution parameterized by is identifiable up to . ∎
Consequently, Theorem 4.2 guarantees a strong identifiablity of our proposed generative model, iFlow. Note that unlike Theorem 3 in (Khemakhem et al., 2019), Theorem 4.2 makes no assumption that the family of approximate posterior distributions contains the true posterior. And we show in experiments that this assumption is unlikely to hold true empirically.
To evaluate our method, we run simulations on a synthetic dataset. This section will elaborate on the details of the generated data set, implementation, evaluation metric and fair comparison with the existing methods.
We generate a synthetic dataset where the sources are non-stationary Gaussian time-series, as described in (Khemakhem et al., 2019): the sources are divided into segments of samples each. The auxiliary variable is set to be the segment index. For each segment, the conditional prior distribution is chosen from the exponential family (2), where , , and , , and the true. The sources to recover are mixed by an invertible multi-layer perceptron (MLP) whose weight matrices are ensured to be full rank.
5.2 Implementation Details
The mapping that outputs the natural parameters of the conditional factorized exponential family is modeled by a multi-layer perceptron with the activation of the last layer being the softplus function. Additionally, a negative activation is taken on the second-order natural parameters in order to ensure the density to be finite. The bijection is modeled by RQ-NSF(AR) (Durkan et al., 2019b) with the flow length of 10 and the bin 8, which gives rise to sufficient flexibility and expressiveness. For each training iteration, we use a mini-batch of size 64, and an Adam optimizer with learning rate chosen in to optimize the learning objective (12).
5.3 Evaluation Metric
As a standard measure used in ICA, the mean correlation coefficient (MCC) between the original sources and the corresponding predicted latents is chosen to be the evaluation metric. A high MCC indicates the strong correlation between the identified latents recovered and the true sources. In experiments, we found that such a metric can be sensitive to the synthetic data generated by different random seeds. We argue that unless one specifies the overall generating procedure including random seeds in particular any comparison remains debatable. This is crucially important since most of the existing works failed to do so. Therefore, we run each simulation of different methods through seed 1 to seed 100 and report averaged MCCs with standard deviations, which makes the comparison fair and meaningful.
5.4 Comparison and Results
We compare our model, iFlow, with iVAE. These two models are trained on the same aforementioned synthetic dataset, with , , . For visualization, we also apply another setting with , , . To evaluate iVAE’s identifying performance, we use the original implementation that is officially released111https://github.com/ilkhem/iVAE/ with exactly the same settings as described in (Khemakhem et al., 2019).
First, we demonstrate a visualization of identifiablity of these two models in a 2-D case () as illustrated in Figure 1, in which we plot the original sources (latent), observations and the identified sources recovered by iFlow and iVAE, respectively. Segments are marked with different colors. Clearly, iFlow outperforms iVAE in identifying the original sources while maintaining the original geometry of source manifold. It is evident that the learned prior of iFlow bears much higher resemblance to the generating prior than that of iVAE in the presence of some trivial indeterminacies of scaling, global sign and permutation of the original sources, which are inevitable even in some cases of linear ICA. This exhibits consistency with the definition of identifiability up to equivalence class that allows for existence of an affine transformation between sufficient statistics, as described in Proposition 4.1. As shown in Figure 1, 1, and 1, iVAE achieves inferior identifying performance in the sense that its estimated priors tend to retain the manifold of the observations. Notably, we also find that despite the relatively high MCC performance of iVAE in Figure 1, iFlow is much more likely to recover the true geometric manifold in which the latent sources lie. In Figure 1, iVAE’s estimated prior collapses in face of a highly nonlinearly mixing case, while iFlow still works well in identifying the sources. Note that these are not rare occurrences. More visualization examples can be found in Appendix 8.
Second, regarding quantitative results as shown in Figure 2, our model, iFlow, consistently outperforms iVAE in MCC by a considerable margin across different random seeds under consideration while experiencing less uncertainty (standard deviation as indicated in the brackets). Moreover, Figure 2 also demonstrates that the energy value of iFlow is much higher than that of iVAE, which serves as evidence that the optimization of the evidence lower bound, as in iVAE, would lead to suboptimal identifiability. The gap between the evidence lower bound and the conditional marginal likelihood is inevitably far from being negligible in practice. For clearer analysis, we also report the correlation coefficients for each source-latent pair in each dimension. As shown in Figure 3, iFlow exhibits much stronger correlation than does iVAE in each single dimension of the latent space.
Finally, we investigate the impact of different choices of activation for generating natural parameters of the exponential family distribution (see Appendix A.1 for details). All of these choices are valid since theoretically the natural parameters form a convex space. However, iFlow(Softplus) achieves the highest identifying performance, suggesting that the range of softplus allows for greater flexibility, which makes itself a perfect choice for our network design.
Among the most significant goals of unsupervised learning is to learn the disentangled representations of observed data, or to identify original latent codes that generate observations (i.e. identifiability). Bridging the theoretical and practical gap of rigorous identifiability, we propose to identify through flows, which directly maximizes the marginal likelihood conditioned on auxiliary variables, establishing a natural framework for recovering original independent sources. In theory, our contribution provides a rigorous proof of identifiability and hence the recovery of the joint distribution between observed and latent variables that leads to principled disentanglement. Empirically, our approach also shows practical advantages over previous methods.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
- Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §1.
- Nice: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §3.
- Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §3.
- Cubic-spline flows. arXiv preprint arXiv:1906.02145. Cited by: §3.
- Neural spline flows. arXiv preprint arXiv:1906.04032. Cited by: §3, §5.2.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §1.
Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, pp. 3765–3773. Cited by: §3.
Nonlinear independent component analysis: existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. Cited by: §3.
- Nonlinear ica using auxiliary variables and generalized contrastive learning. arXiv preprint arXiv:1805.08651. Cited by: §3.
- Variational autoencoders and nonlinear ica: a unifying framework. arXiv preprint arXiv:1907.04809. Cited by: §1, §3, §4.1, §4.2, §4.4, §5.1, §5.4.
- Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §1.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
- Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §3.
- Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §3.
- Normalizing flows: introduction and ideas. arXiv preprint arXiv:1908.09257. Cited by: §1, §1.
- Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §1.
- Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §3.
- Neural importance sampling. arXiv preprint arXiv:1808.03856. Cited by: §3.
- Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770. Cited by: §1, §3.
- High-dimensional probability estimation with deep density models. arXiv preprint arXiv:1302.5125. Cited by: §3.
Density estimation in infinite dimensional exponential families.
The Journal of Machine Learning Research18 (1), pp. 1830–1888. Cited by: §4.1.
- A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 (2), pp. 145–164. Cited by: §3.
- Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences 8 (1), pp. 217–233. Cited by: §1, §3.