The ability to learn useful representations of data with little or no supervision is a key challenge towards applying artificial intelligence to the vast amounts of unlabelled data collected in the world. While it is clear that the usefulness of a representation learned on data heavily depends on the end task which it is to be used for, one could imagine that there exists properties of representations which are useful for many real-world tasks simultaneously. In a seminal paper on representation learning Bengio et al. (2013a) proposed such a set of meta-priors
. The meta-priors are derived from general assumptions about the world such as the hierarchical organization or disentanglement of explanatory factors, the possibility of semi-supervised learning, the concentration of data on low-dimensional manifolds, clusterability, and temporal and spatial coherence.
Recently, a variety of (unsupervised) representation learning algorithms have been proposed based on the idea of autoencoding where the goal is to learn a mapping from high-dimensional observations to a lower-dimensional representation space such that the original observations can be reconstructed (approximately) from the lower-dimensional representation. While these approaches have varying motivations and design choices, we argue that essentially all of the methods reviewed in this paper implicitly or explicitly have at their core at least one of the meta-priors from Bengio et al. (2013a).
Given the unsupervised nature of the upstream representation learning task, the characteristics of the meta-priors enforced in the representation learning step determine how useful the resulting representation is for the real-world end task. Hence, it is critical to understand which meta-priors are targeted by which models and which generic techniques are useful to enforce a given meta-prior. In this paper, we provide a unified view which encompasses the majority of proposed models and relate them to the meta-priors proposed by Bengio et al. (2013a). We summarize the recent work focusing on the meta-priors in Table 1.
Meta-priors of Bengio et al. (2013a).
Meta-priors capture very general premises about the world and are therefore arguably useful for a broad set of downstream tasks. We briefly summarize the most important meta-priors which are targeted by the reviewed approaches.
Disentanglement: Assuming that the data is generated from independent factors of variation, for example object orientation and lighting conditions in images of objects, disentanglement as a meta-prior encourages these factors to be captured by different independent variables in the representation. It should result in a concise abstract representation of the data useful for a variety of downstream tasks and promises improved sample efficiency.
Hierarchical organization of explanatory factors: The intuition behind this meta-prior is that the world can be described as a hierarchy of increasingly abstract concepts. For example natural images can be abstractly described in terms of the objects they show at various levels of granularity. Given the object, a more concrete description can be given by object attributes.
The idea is to share a representation between a supervised and an unsupervised learning task which often leads to synergies: While the number of labeled data points is usually too small to learn a good predictor (and thereby a representation), training jointly with an unsupervised target allows the supervised task to learn a representation that generalizes, but also guides the representation learning process.
Clustering structure: Many real-wold data sets have multi-category structure (such as images showing different object categories), with possibly category-dependent factors of variation. Such structure can be captured with a latent mixture model where each mixture component corresponds to one category, and its distribution models the factors of variation within that category. This naturally leads to a representation with clustering structure.
Very generic concepts such as smoothness as well as temporal and spatial coherence are not specific to unsupervised learning and are used in most practical setups (for example weight decay to encourage smoothness of predictors, and convolutional layers to capture spatial coherence in image data). We discuss the implicit supervision used by most approaches in Section 7.
|Disentanglement||-VAE (6) Higgins et al. (2017), FactorVAE (8) Kim and Mnih (2018), -TCVAE (3.1.2) Chen et al. (2018), InfoVAE (9) Zhao et al. (2017a), DIP-VAE (11) Kumar et al. (2018), HSIC-VAE (12) Lopez et al. (2018), HFVAE (13) Esmaeili et al. (2018), VIB Alemi et al. (2016), Information dropout (15) Achille and Soatto (2018), DC-IGN Kulkarni et al. (2015), FaderNetworks (18) Lample et al. (2017), VFAE (17) Louizos et al. (2016)|
|Hierarchical representation111While PixelGAN-AE Makhzani and Frey (2017), VLAE Chen et al. (2017), and VQ-VAE van den Oord et al. (2017) do not explicitly model a hierarchy of latents, they learn abstract representations capturing global structure of images Makhzani and Frey (2017); Chen et al. (2017) and speech signals van den Oord et al. (2017), hence internally representing the data in a hierarchical fashion.||PixelVAE Gulrajani et al. (2017), LVAE Sønderby et al. (2016), VLaAE Zhao et al. (2017b), Semi-supervised VAE Kingma et al. (2014), PixelGAN-AE Makhzani and Frey (2017), VLAE Chen et al. (2017), VQ-VAE van den Oord et al. (2017)|
|Semi-supervised learning||Semi-supervised VAE Kingma et al. (2014), Narayanaswamy et al. (2017), PixelGAN-AE (14) Makhzani and Frey (2017), AAE (16) Makhzani et al. (2015)|
|Clustering||PixelGAN-AE (14) Makhzani and Frey (2017), AAE (16) Makhzani et al. (2015), JointVAE Dupont (2018), SVAE Johnson et al. (2016)|
Mechanisms for enforcing meta-priors.
We identify the following three mechanisms to enforce meta-priors:
Regularization of the encoding distribution (Section 3).
Choice of the encoding and decoding distribution or model family (Section 4).
Choice of a flexible prior distribution of the representation (Section 5).
For example, regularization of the encoding distribution is often used to encourage disentangled representations. Alternatively, factorizing the encoding and decoding distribution in a hierarchical fashion allows us to impose a hierarchical structure to the representation. Finally, a more flexible prior, say a mixture distribution, can be used to encourage clusterability.
, underlying most of the methods considered in this paper, and several techniques used to estimate divergences between probability distributions. We then present a detailed discussion of regularization-based methods in Section3, review methods relying on structured encoding and decoding distributions in Section 4, and present methods using a structured prior distribution in Section 5. We conclude the review section by an overview of related methods such as cross-domain representation learning Liu et al. (2017); Lee et al. (2018); Gonzalez-Garcia et al. (2018) in Section 6. Finally, we provide a critique of unsupervised representation learning through the rate-distortion framework of Alemi et al. (2018) and discuss the implications in Section 7.
We assume familiarity with the key concepts in Bayesian data modeling. For a gentle introduction to VAEs we refer the reader to Doersch (2016). VAEs Kingma and Welling (2014); Rezende et al. (2014) aim to learn a parametric latent variable model by maximizing the marginal -likelihood of the training data . By introducing an approximate posterior which is an approximation of the intractable true posterior we can rewrite the negative -likelihood as
and is the expectation of the function w.r.t. the empirical data distribution. The approach is illustrated in Figure 1. The first term in (1) measures the reconstruction error and the second term quantifies how well matches the prior . The structure of the latent space heavily depends on this prior. As the KL divergence is non-negative, lower-bounds the marginal -likelihood and is accordingly called the evidence lower bound (ELBO).
There are several design choices available: (1) The prior distribution on the latent space, , (2) the family of approximate posterior distributions, , and (3) the decoder distribution, . Ideally, the approximate posterior should be flexible enough to match the intracable true posterior . As we will see later, there are many available options for these design choices, leading to various trade-offs in terms of the learned representation.
In practice, the first term in (1) can be estimated from samples
and gradients are backpropagated through the sampling operation using thereparametrization trick (Kingma and Welling, 2014, Section 2.3), enabling minimization of (1
) via minibatch-stochastic gradient descent (SGD). Depending on the choice ofthe second term can either be computed in closed form or estimated from samples. For the usual choice of , where and
are deterministic functions parametrized as neural networks, andfor which the KL-term in (1) can be computed in closed form (more complicated choices of rarely allow closed form computation). To this end, we will briefly discuss two ways in which one can measure distances between distributions. We will focus on intuition behind these techniques and provide pointers to detailed expositions.
Adversarial density-ratio estimation.
Given a convex function for which , the -divergence between and is defined as
For example, the choice corresponds to . Given samples from and we can estimate the -divergence using the density-ratio trick Nguyen et al. (2010); Sugiyama et al. (2012), popularized recently through the generative adversarial network (GAN) framework Goodfellow et al. (2014). The trick is to express and as conditional distributions, conditioned on a label , and reduce the task to binary classification. In particular, let , , and consider a discriminator trained to predict the probability that its input is a sample from distributions rather than , i.e, predict . The density ratio can be expressed as
where the second equality follows from Bayes’ rule under the assumption that the marginal class probabilities are equal. As such, given i.i.d. samples from
and a trained classifierone can estimate the KL-divergence by simply computing
As a practical alternative, some approaches replace the KL term in (1) with an arbitrary divergence (e.g., maximum mean discrepancy). Note, however, that the resulting objective does not necessarily lower-bound the marginal log-likelihood of the data.
Maximum mean discrepancy (MMD) Gretton et al. (2012).
Intuitively, the distances between distributions are computed as distances between mean embeddings of features as illustrated in Figure 2. More formally, let be a continuous, bounded, positive semi-definite kernel and be the corresponding reproducing kernel Hilbert space, induced by the feature mapping . Then, the MMD of distributions and is
For example, setting and , MMD reduces to the difference between the means, i.e., . By choosing an appropriate mapping
one can estimate the divergence in terms of higher order moments of the distribution.
MMD vs -divergences in practice.
The MMD is known to work particularly well with multivariate standard normal distributions. It requires a sample size roughly on the order of the data dimensionality. When used as a regularizer (see Section3), it generally allows for stable optimization. A disadvantage is that it requires selection of the kernel and its bandwidth parameter. In contrast, -divergence estimators based on the density-ratio trick can in principle handle more complex distributions than MMD. However, in practice they require adversarial training which currently suffers from optimization issues. For more details consult (Tolstikhin et al., 2018, Section 3).
Some of the methods we review rely on deterministic encoders and decoders. We denote by and the deterministic encoder and decoder, respectively. A popular objective for training an autoencoder is to minimize the -loss, namely
If and are linear maps and the representation is lower-dimensional than , (4
) corresponds to principal component analysis (PCA), which leads towith decorrelated entries. Furthermore, we obtain (4) by removing the -term from in (1) and using a deterministic encoding distribution and a Gaussian decoding distribution . Therefore, the major difference between and is that does not enforce a prior distribution on the latent space (e.g., through a -term), and minimizing hence does not yield a generative model.
3 Regularization-based methods
A classic approach to enforce some meta-prior on the latent representations is to augment with regularizers that act on the approximate posterior and/or the aggregate (approximate) posterior . The vast majority of recent work can be subsumed into an objective of the form
where and are regularizers and the corresponding weights. Firstly, we note a key difference between regularizers and is that the latter depends on the entire data set through . In principle, this prevents the use of mini-batch SGD to solve (5). In practice, however, one can often obtain good mini-batch-based estimates of . Secondly, the regularizers bias towards a looser (larger) upper bound on the negative marginal -likelihood. From this perspective it is not surprising that many approaches yield a lower reconstruction quality (which typically corresponds to a larger negative -likelihood). For deterministic autoencoders, there is no such concept as an aggregated posterior, so we consider objectives of the form .
In this section, we first review regularizers which can be computed in a fully unsupervised fashion (some of them optionally allow to include partial label information). Then, we turn our attention to regularizers which require supervision.
|-VAE Higgins et al. (2017)||VAE|
|VIB Alemi et al. (2016)||VAE||O|
|PixelGAN-AEMakhzani and Frey (2017)||VAE||O|
|InfoVAE Zhao et al. (2017a)||VAE|
|Info. dropout Achille and Soatto (2018)||VAE||O|
|HFVAE Esmaeili et al. (2018)||VAE|
|FactorVAE Kim and Mnih (2018); Chen et al. (2018)||VAE|
|DIP-VAE Kumar et al. (2018)||VAE|
|HSIC-VAE Lopez et al. (2018)||VAE||O|
|VFAE Louizos et al. (2016)||VAE||✓|
|DC-IGN Kulkarni et al. (2015)||VAE||✓|
|FaderNet. Lample et al. (2017); Hadad et al. (2018)222Lample et al. (2017); Hadad et al. (2018) do not enforce a prior on the latent distribution and therefore cannot generate unconditionally.||AE||✓|
|AAE/WAE Makhzani et al. (2015); Tolstikhin et al. (2018)||AE||O|
3.1 Unsupervised methods targeting disentanglement and independence
Disentanglement is a critical meta-prior considered by Bengio et al. (2013a). Namely, assuming the data is generated from a few statistically independent factors, uncovering those factors should be extremely useful for a plethora of downstream tasks. An example for (approximately) independent factors underlying the data are class, stroke thickness, and rotation of handwritten digits in the MNIST data set. Other popular data sets are the CelebA face data set Liu et al. (2015) (factors involve, e.g., hair color and facial attributes such as glasses), and synthetic data sets of geometric 2D shapes or rendered 3D shapes (e.g., 2D Shapes Higgins et al. (2017), 3D Shapes Kim and Mnih (2018), 3D Faces Paysan et al. (2009), 3D Chairs Aubry et al. (2014)) for which the data generative process and hence the ground truth factors are known (see Figure 4 for an example).
The main idea behind several recent works on disentanglement is to augment the loss with regularizers which encourage disentanglement of the latent variables . Formally, assume that the data depends on conditionally independent factors , i.e., , and possibly conditionally dependent factors . The goal is to augment such that the inference model learns to predict and hence (partially) invert the data-generative process.
Disentanglement quality of inference models is typically evaluated based on ground truth factors of variation (if available). Specifically, disentanglement metrics measure how predictive the individual latent factors are for the ground-truth factors, see, e.g., Higgins et al. (2017); Kim and Mnih (2018); Kumar et al. (2018); Eastwood and Williams (2018); Chen et al. (2018); Ridgeway and Mozer (2018). While many authors claim that their method leads to disentangled representations, it is unclear what the proper notion of disentanglement is and how effective these methods are in the unsupervised setting (see Locatello et al. (2018) for a large-scale evaluation). We therefore focus on the concept motivating each method rather than claims on how well each method disentangles the factors underlying the data.
3.1.1 Reweighting the ELBO: -Vae
Higgins et al. (2017) propose to weight the second term in (1) (henceforth referred to as the -term) by a coefficient ,333Higgins et al. (2017) also explore but discovers that this choice does not lead to disentanglement. which can be seen as adding a regularizer equal to the -term with coefficient to
This type of regularization encourages to better match the factorized prior , which in turn constrains the implicit capacity of the latent representation and encourages it be factorized. Burgess et al. (2018) provide a through theoretical analysis of -VAE based on the information bottleneck principle Tishby et al. (2000). Further, they propose to gradually decrease the regularization strength until good quality reconstructions are obtained as a robust procedure to adjust the tradeoff between reconstruction quality and disentanglement (for a hard-constrained variant fo -VAE).
3.1.2 Mutual information of and : FactorVAE, -TCVAE, InfoVAE
where is the mutual information of and w.r.t. the distribution . The decomposition (7) was first derived by Hoffman and Johnson (2016); an alternative derivation can be found in Kim and Mnih (2018, Appendix C).
Kim and Mnih (2018) observe that the regularizer in encourages to be factorized (assuming is a factorized distribution) by penalizing the second term in (7), but discourages the latent code to be informative by simultaneously penalizing the first term in (7). To reinforce only the former effect, they propose to regularize with the total correlation of
—a popular measure of dependence for multiple random variablesWatanabe (1960). The resulting objective has the form
where the last term is the total correlation. To estimate it from samples, Kim and Mnih (2018) rely on the density ratio trick Nguyen et al. (2010); Sugiyama et al. (2012) which involves training a discriminator (see Section 2).
Zhao et al. (2017a) start from an alternative way of writing
where . Similarly to Kim and Mnih (2018), to encourage disentanglement, they propose to reweight the first term in (9) and to encourage a large mutual information between and by adding a regularizer proportional to to (9). Further, by rearranging terms in the resulting objective, they arrive at
For tractability reasons, Zhao et al. (2017a) propose to replace the last term in (10) by other divergences such as Jensen-Shannon divergence (implemented as a GAN Goodfellow et al. (2014)), Stein variational gradient Liu and Wang (2016), or MMD Gretton et al. (2012) (see Section 2).
Kumar et al. (2018) suggest matching the moments of the aggregated posterior to a multivariate standard normal prior during optimization of to encourage disentanglement of the latent variables . Specifically, they propose to match the covariance of and by penalizing their -distance (which amounts to decorrelating the entries of ) leading to the Disentangled Inferred Prior objective:
3.1.3 Independence between groups of latents: HSIC-VAE, HFVAE
Groups/clusters, potentially involving hierarchies, is a structure prevalent in many data sets. It is therefore natural to take this structure into account when learning disentangled representations, as seen next.
where (an estimator of HSIC is defined in (A) in Appendix A). This is in contrast to the methods Kim and Mnih (2018); Chen et al. (2018); Zhao et al. (2017a); Kumar et al. (2018) penalizing statistical dependence of all individual latent variables. In addition to controling (in)dependence relations of the latent variables, the HSIC can be used to remove sensitive information, provided as labels with the training data, from latent representation by using the regularizer (where is estimated from samples) as extensively explored by Louizos et al. (2016) (see Section 3.4).
Starting from the decomposition (7), Esmaeili et al. (2018) hierarchically decompose the -term in (7) into a regularization term of the dependencies between groups of latent variables and regularization of the dependencies between the random variables in each group . Reweighting different regularization terms allows to encourage different degrees of intra and inter-group disentanglement, leading to the following objective:
Here, controls the mutual information between the data and latent variables, and and determine the regularization of dependencies between groups and within groups, respectively, by penalizing the corresponding total correlation. Note that the grouping can be nested to introduce deeper hierarchies.
3.2 Preventing the latent code from being ignored: PixelGAN-AE and VIB
Makhzani and Frey (2017) argue that, if is not too powerful (in the sense that it cannot model the data distribution unconditionally, i.e., without using the latent code ) the term in (7) and the reconstruction term in (1) have competing effects: A small mutual information makes reconstruction of from challenging for , leading to a large reconstruction error. Conversely, a small reconstruction error requires the code to be informative and hence to be large. In contrast, if the decoder is powerful, e.g., a conditional PixelCNN van den Oord et al. (2016), such that it can obtain a small reconstruction error without relying on the latent code, the mutual information and reconstruction terms can be minimized largely independent, which prevents the latent code from being informative and hence providing a useful representation (this issue is known as the information preference property Chen et al. (2017) and is discussed in more detail in Section 4). In this case, to encourage the code to be informative Makhzani and Frey (2017) propose to drop the term in (7), which can again be seen as a regularizer
The term remaining in (7) after removing is approximated using a GAN. Makhzani and Frey (2017) show that relying on a powerful PixelCNN decoder can be trained while keeping the latent code informative. Depending on the choice of the prior (categorical or Gaussian), the latent code picks up information of different levels of abstraction, for example the digit class and writing style in the case of MNIST.
VIB, information dropout.
Alemi et al. (2016) and Achille and Soatto (2018) both derive a variational approximation of the information bottleneck objective Achille and Soatto (2018), which targets learning a compact representation of some random variable that is maximally informative about some random variable . In the special case, when , the approximation derived in Alemi et al. (2016) one obtains an objective equivalent to in (1) (c.f. (Alemi et al., 2016, Appendix B) for a discussion), whereas doing so for Achille and Soatto (2018) leads to
Achille and Soatto (2018) derive (more) tractable expressions for (15) and establishe a connection to dropout for particular choices of and . Alemi et al. (2018) propose an information-theoretic framework studying the representation learning properties of VAE-like models through a rate-distortion tradeoff. This framework recovers -VAE but allows for a more precise navigation of the feasible rate-distortion region than the latter. Alemi and Fischer (2018) further generalize the framework of Alemi et al. (2016), as discussed in Section 7.
3.3 Deterministic encoders and decoders: AAE and WAE
Adversarial Autoencoders (AAEs) Makhzani et al. (2015) turn a standard autoencoder into a generative model by imposing a prior distribution on the latent variables by penalizing some statistical divergence between and using a GAN. Specifically, using the negative -likelihood as reconstruction loss, the AAE objective can be written as
In all experiments in Makhzani et al. (2015) encoder and decoder are taken to be deterministic, i.e., and are replaced by and , respectively, and the negative -likelihood in (16) is replaced with the standard autoencoder loss . The advantage of implementing the regularizer using a GAN is that any
we can sample from, can be matched. This is helpful to learn representations: For example for MNIST, enforcing a prior that involves both categorical and Gaussian latent variables is shown to disentangle discrete and continuous style information in unsupervised fashion, in the sense that the categorical latent variables model the digit index and continuous random variables the writing style. Disentanglement can be improved by leveraging (partial) label information, regularizing the cross-entropy between the categorical latent variables and the label one-hot encodings. Partial label information also allows to learn a generative model for digits with a Gaussian mixture model prior, with every mixture component corresponding to one digit index.
3.4 Supervised methods: VFAEs, FaderNetworks, and DC-IGN
Variational Fair Autoencoders (VFAEs) Louizos et al. (2016) assume a likelihood of the form , where models (categorical) latent factors one wants to remove (for example sensitive information), and models the remaining latent factors. By using an approximate posterior of the form and by imposing factorized prior one can encourage independence of from . However, might still contain information about , in particular in the (semi-) supervised setting where encodes label information that might be correlated with , and additional factors of variation , i.e., (this setup was first considered in Kingma et al. (2014); see Section 4). To mitigate this issue, Louizos et al. (2016) propose to add an MMD-based regularizer to , encouraging independence between and , i.e.,
where . To reduce the computational complexity of the MMD the authors propose to use random Fourier features Rahimi and Recht (2008). Lopez et al. (2018) also consider the problem of censoring side information, but use the HSIC regularizer instead of MMD. In contrast to MMD, the HSIC is amenable to side information of a non-categorical distribution. Furthermore, it is shown in Lopez et al. (2018, Appendix E) that VFAE and HSIC are equivalent to censoring in case is a binary random variable.
(e.g., facial attributes such as hair color or whether glasses are present; encoded as binary vector in), the encoder of a FaderNetwork Lample et al. (2017) is adversarially trained to learn a feature representation invariant to the attribute values, and the decoder reconstructing the original image from and . The resulting model is able to manipulate the attributes of a testing image (without known attribute information) by setting the entries of at the input of as desired. In particular, it allows for continuous control of the attributes (by choosing non-integer attribute values in ).
To make invariant to a discriminator predicting the probabilities of the attribute vector from is trained concurrently with to maximize the -likelihood . This discriminator is used adversarially in the training of , encouraging to produce a latent code from which it is difficult to predict using as
i.e., the regularizer encourages to produce codes for which assigns a high likelihood to incorrect attribute values.
Hadad et al. (2018) propose a method similar to FaderNetworks that first separately trains an encoder jointly with a classifier to predict . The code produced by is then concatenated with that produced by a second encoder and fed to the decoder . and are now jointly trained for reconstruction (while keeping fixed) and the output of is regularized as in (18) to ensure that and are disentangled. While the model from Hadad et al. (2018)
does not allow fader-like control of attributes, it provides a representation that facilitates swapping and interpolation of attributes, and can be use for retrieval. Note that in contrast to all previously discussed methods, both of these techniques do not provide a mechanism for unconditional generation.
Kulkarni et al. (2015) assume that the training data is generated by an interpretable, compact graphics code and aim to recover this code from the data using a VAE. Specifically, they consider data sets of rendered object images for which the underlying graphics code consists of extrinsic latent variables—object rotation and light source position—and intrinsic latent variables, modeling, e.g., object identity and shape. Assuming supervision in terms of which latent factors are active (relative to some reference value), a representation disentangling intrinsic and the different extrinsic latent variables is learned by optimizing
on different types of mini-batches (which can be seen as implicit regularization): Mini-batches containing images for which all but one of the extrinsic factors are fixed, and mini-batches containing images with fixed extrinsic factors, but varying intrinsic factors. During the forward pass, the latent variables predicted by the encoder corresponding to fixed factors are replaced with the mini-batch average to force the decoder to explain all the variance in the mini-batch through the varying latent variables. In the backward step, gradients are passed through the latent space ignoring the averaging operation. This procedure allows to learn a disentangled representation for rendered 3D faces and chairs that allow to control extrinsic factors similarly as in a rendering engine. The models generalize to unseen object identities.
4 Factorizing the encoding and decoding distributions
Figure (a) shows an example VAE with hierarchical encoding distribution and PixelCNN decoding distribution. Figure (b) gives an overview of factorizations used by different models. We indicate the structure of the encoding (ENC) and decoding (DEC) distribution as follows: (H) hierarchical, (A) autoregressive, (default) fully connected or convolutional feed-feed forward neural network). We indicate the prior distribution as follows: () multivariate standard Normal, () categorical, (M) mixture distribution, (G) graphical model, (L) learned prior. The last column (Y) indicates whether supervision is used.
Besides regularization, another popular way to impose a meta-prior is factorizing the encoding and/or decoding distribution in a certain way (see Figure 5 for an overview). This translates directly or indirectly into a particular choice of the model class/network architecture underlying these distributions. Concrete examples are hierarchical architectures and architectures with constrained receptive field. This can be seen as hard constraints on the learning problem, rather than regularization as discussed in the previous section. While this is not often done in the literature, one could obviously combine a specific structured model architecture with some regularizer, for example to learn a disentangled hierarchical representation. Choosing a certain model class/architecture is not only interesting from a representation point of view, but also from a generative modeling perspective. Indeed, certain model classes/architectures allow to better optimize ultimately leading to a better generative model.
Kingma et al. (2014) harness the VAE framework for semi-supervised learning. Specifically, in the “M2 model”, the latent code is divided into two parts and where is (typically discrete) label information observed for a subset of the training data. More specifically, the inference model takes the form , i.e., there is a hierarchy between and . During training, for samples for which a label is a available, the inference model is conditioned on (i.e., ) and is adapted accordingly, and for samples without label, the label is inferred from . This model hence effectively disentangles the latent code into two parts and and allows for semi-supervised classification and controlled generation by holding one of the factors fixed and generating the other one. This model can optionally be combined with an additional model learned in unsupervised fashion to obtain an additional level of hierarchy (termed “M1 + M2 model” in Kingma et al. (2014)).
Analyzing the VAE framework through the lens of Bits-Back coding Hinton and Van Camp (1993); Honkela and Valpola (2004), Chen et al. (2017) identify the so-called information preference property: The second term in (1) encourages the latent code to only store the information that cannot be modeled locally (i.e., unconditionally without using the latent code) by the decoding distribution
. As a consequence, when the decoding distribution is a powerful autoregressive model such as conditional PixelRNNVan Oord et al. (2016) or PixelCNN van den Oord et al. (2016) the latent code will not be used to encode any information and will perfectly match the prior , as previously observed by many authors. While this not necessarily an issue in the context of generative modeling (where the goal is to maximize testing -likelihood), it is problematic from a representation learning point of view as one wants the latent code to store meaningful information. To overcome this issue, Chen et al. (2017) propose to adapt the structure of the decoding distribution such that it cannot model the information one would like to store, and term the resulting model variational lossy autoencoder (VLAE). For example, to encourage to capture global high-level information, while letting model local information such as texture, one can use an autoregressive decoding distribution with a limited local receptive field , where is a window centered in pixel , that cannot model long-range spatial dependencies. Besides the implications of the information preference property for representation learning, Chen et al. (2017) also explore the orthogonal direction of using a learned prior based on autoregressive flow Kingma et al. (2016) to improve generative modeling capabilities of VLAE.
PixelVAEs Gulrajani et al. (2017) use a VAE with feed-forward convolutional encoder and decoder, combining the decoder with a (shallow) conditional PixelCNN van den Oord et al. (2016) to predict the output probabilities. Furthermore, they employ a hierarchical encoder and decoder structure with multiple levels of latent variables. In more detail, the encoding and decoding distributions are factorized as and . Here, are groups of latent variables (rather than individual entries of ), the are parametric distributions (typically Gaussian with diagonal covariance matrix) whose parameters are predicted from different layers of the same CNN (with layer index increasing in ), is a conditional PixelCNN, and the factors in are realized by a feed-forward convolutional networks. From a representation learning perspective, this approach leads to the extraction of high- and low-level features on one hand, allowing for controlled generation of local and global structure, and on the other hand results in better clustering of the codes according to classes in the case of multi-class data. From a generative modeling perspective, this approach obtains testing likelihood competitive with or better than computationally more complex (purely autoregressive) PixelCNN and PixelRNN models. Only stochastic layers are explored experimentally.
In contrast to PixelVAEs, Ladder VAEs (LVAEs) Sønderby et al. (2016) perform top-down inference, i.e., the encoding distribution is factorized as , while using the same factorization for
as PixelVAE (although employing a simple factorized Gaussian distribution forinstead of a PixelCNN). The are parametrized Gaussian distributions whose parameters are inferred top-down using a precision-weighted combination of (i) bottom-up predictions from different layers of the same feed-forward encoder CNN (similarly as in PixelVAE) with (ii) top-down predictions obtained by sampling from the hierarchical distribution (see (Sønderby et al., 2016, Figure 1b) for the corresponding graphical model representation). When trained with a suitable warm-up procedure, LVAEs are capable of effectively learning deep hierarchical latent representations, as opposed to hierarchical VAEs with bottom-up inference models which usually fail to learn meaningful representations with more than two levels (see (Sønderby et al., 2016, Section 3.2)).
Variational Ladder AutoEncoders.
Yet another approach is taken by Variational Ladder autoencoders (VLaAEs) Zhao et al. (2017b): While no explicit hierarchical factorization of in terms of the is assumed, is implemented as a feed-forward neural network, implicitly defining a top-down hierarchy among the by taking the as inputs on different layers, with the layer index proportional to . is set to a fixed variance factored Gaussian whose mean vector is predicted from . For the encoding distribution the same factorization and a similar implementation as that of PixelVAE is used. Implicitly encoding a hierarchy into rather than explicitly as by PixelVAE and LVAE avoids the difficulties described by Sønderby et al. (2016) involved with training hierarchical models with more than two levels of latent variables. Furthermore, Zhao et al. (2017b) demonstrate that this approach leads to a disentangled hierarchical representation, for instance separating stroke width, digit width and tilt, and digit class, when applied to MNIST.
5 Structured prior distribution
Instead of choosing the encoding distribution, one can also encourage certain meta-priors by directly choosing the prior distribution of the generative model. For example, relying on a prior involving discrete and continuous random variables encourages them to model different types of factors, such as the digits and the writing style, respectively, in the MNIST data set, which can be seen as a form of clustering. This is arguably the most explicit way to shape a representation, as the prior directly acts on its distribution.
5.1 Graphical model prior
One of the first attempts to learn latent variable models with structured prior distributions using the VAE framework is Johnson et al. (2016). Concretely, the latent distribution with general graphical model structure can capture discrete mixture models such as Gaussian mixture models, linear dynamical systems, and switching linear dynamical systems, among others. Unlike many other VAE-based works, Johnson et al. (2016)
rely on a fully Bayesian framework including hyperpriors for the likelihood/decoding distribution and the structured latent distribution. While such a structuredallows for efficient inference (e.g., using message passing algorithms) when the likelihood is an exponential family distribution, it becomes intractable when the decoding distribution is parametrized through a neural network as commonly done in the VAE framework, the reason for which the latter includes an approximate posterior/encoding distribution. To combine the tractability of conjugate graphical model inference with the flexibility of VAEs, Johnson et al. (2016) employ inference models that output conjugate graphical model potentials Wainwright and Jordan (2008) instead of the parameters of the approximate posterior distribution. In particular, these potentials are chosen such that they have a form conjugate to the exponential family, hence allowing for efficient inference when combined with the structured . The resulting algorithm is termed structured VAE (SVAE). Experiments show that SVAE with a Gaussian mixture prior learns a generative model whose latent mixture components reflect clusters in the data, and SVAE with a switching linear dynamical system prior learns a representation that reflects behavior state transitions in motion recordings of mouses.
Narayanaswamy et al. (2017) consider latent distributions with graphical model structure similar to Johnson et al. (2016), but they also incorporate partial supervision for some of the latent variables as Kingma et al. (2014). However, unlike Kingma et al. (2014) which assumes a posterior of the form , they do not assume a specific factorization of the partially observed latent variables and the unobserved ones (neither for nor for the marginals and ), and no particular distributional form of and . To perform inference for with arbitrary dependence structure, Narayanaswamy et al. (2017) derive a new Monte Carlo estimator. The proposed approach is able to disentangle digit index and writing style on MNIST with partial supervision of the digit index (similar to Kingma et al. (2014)). Furthermore, this approach can disentangle identity and lighting direction of face images with partial supervision assuming the product of categorical and continuous distribution, respectively, for the prior (using the the Gumbel-Softmax estimator Jang et al. (2017); Maddison et al. (2016) to model the categorical part in the approximate posterior).
5.2 Discrete latent variables
JointVAE Dupont (2018) equips the -VAE framework with heterogeneous latent variable distributions by concatenating continuous latent variables with discrete ones for improved disentanglement of different types of latent factors. The corresponding approximate posterior is factorized as and the Gumbel-Softmax estimator Jang et al. (2017); Maddison et al. (2016) is used to obtain a differentiable relaxation of the categorical distribution . The regularization strength in the (a constrained variant of) -VAE objective (6) is gradually increased during training, possibly assigning different weights to the regularization term corresponding to the discrete and continuous random variables (the regularization term in (6) decomposes as ). Numerical results (based on visual inspection) show that the discrete latent variables naturally model discrete factors of variation such as digit class in MNIST or garment type in Fashion-MNIST and hence disentangle such factors better than models with continuous latent variables only.
van den Oord et al. (2017) realize a VAE with discrete latent space structure using vector quantization, termed VQ-VAE. Each latent variable is taken to be a categorical random variable with categories, and the approximate posterior is assumed deterministic. Each category is associated with an embedding vector . The embedding operation induces an additional latent space dimension of size . For example, if the latent representation is an feature map, the embedded latent representation is a feature map. The distribution is implemented using a deterministic encoder network with -dimensional output, quantized w.r.t. the embedding vectors . In summary, we have