Improving Disentangled Representation Learning with the Beta Bernoulli Process

09/03/2019 ∙ by Prashnna Kumar Gyawali, et al. ∙ 0

To improve the ability of VAE to disentangle in the latent space, existing works mostly focus on enforcing independence among the learned latent factors. However, the ability of these models to disentangle often decreases as the complexity of the generative factors increases. In this paper, we investigate the little-explored effect of the modeling capacity of a posterior density on the disentangling ability of the VAE. We note that the independence within and the complexity of the latent density are two different properties we constrain when regularizing the posterior density: while the former promotes the disentangling ability of VAE, the latter -- if overly limited -- creates an unnecessary competition with the data reconstruction objective in VAE. Therefore, if we preserve the independence but allow richer modeling capacity in the posterior density, we will lift this competition and thereby allow improved independence and data reconstruction at the same time. We investigate this theoretical intuition with a VAE that utilizes a non-parametric latent factor model, the Indian Buffet Process (IBP), as a latent density that is able to grow with the complexity of the data. Across three widely-used benchmark data sets and two clinical data sets little explored for disentangled learning, we qualitatively and quantitatively demonstrated the improved disentangling performance of IBP-VAE over the state of the art. In the latter two clinical data sets riddled with complex factors of variations, we further demonstrated that unsupervised disentangling of nuisance factors via IBP-VAE -- when combined with a supervised objective -- can not only improve task accuracy in comparison to relevant supervised deep architectures but also facilitate knowledge discovery related to task decision-making. A shorter version of this work will appear in the ICDM 2019 conference proceedings.



There are no comments yet.


page 1

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

An inherent goal in deep learning is to distill task-relevant latent representations that are invariant to other nuisance factors in the data. State-of-the-art deep neural networks achieve this by careful engineering of the network architecture, along with supervised training with a large number of task labels. However, the effectiveness of supervised training relies heavily on data quantity and label quality, especially in data with a wide range of data-specific factors of variations. Moreover, interpreting the results of these networks – important in areas such as clinical tasks – remains challenging


Unsupervised disentangled representation learning provides a task-agnostic approach to learn latent generative factors that are semantically interpretable and mutually invariant

[17, 9, 20, 26, 10]. Many of recent successes in this area are based on variational autoencoders (VAE), which modernize variational inference by using neural networks to parameterize both the likelihood of data given latent variable , , and the approximated posterior density of , [23, 35]. The objective of VAE training is thus to maximize the variational evidence lower bound (ELBO) of the marginal data likelihood:


where the first term can be interpreted as data reconstruction, while the second penalty term constrains the approximated posterior density to be similar to a prior by minimizing their Kullback-Leibler (KL) divergence.

To improve disentangled learning in VAE, the primary focus has been on enforcing independence among the learned latent factors, achieved by more heavily penalizing the distance from [17] or its marginal density [26] to a prior that is independent among dimensions. This may be strengthened by an explicit independence penalty on , e.g., either added to the ELBO [20] or isolated from the ELBO through total-correlation decomposition [9]. These investigations, however, are carried out in the context of a Gaussian approximation of the posterior density, limiting its ability to model generative factors with increased complexity [17].

In parallel, enabling richer posterior approximations has been an active topic of interest for improving data reconstructions in VAE [40, 24]. This is often achieved by designing more complex densities and/or with increased modeling power. Their effect on the disentangling ability of the VAE, however, has not been considered.

In this work, we investigate the little-explored relationship between the modeling capacity of a posterior density and the disentangling ability of the VAE. Following [9], we note that when constraining a (marginal) posterior density to an independent prior, we enforce two effects: the independence among the latent factors, and the complexity of the density. The former has to do with disentangling, while the latter affects the modeling capacity of VAE: when enforcing an independent density with limited modeling capacity, the latter creates an unnecessary tension with the reconstructing objective (data likelihood). Therefore, alternative to directly reinforcing the independence, we rationalize that a richer modeling capacity will indirectly improve disentangling by reducing this tension.

Formally, we hypothesize that an independent latent factor model with increased modeling capacity will improve disentangled learning of generative factors with increased complexity. We investigate this theoretical intuition with a VAE model that utilizes a non-parametric Bayesian latent factor model – the Beta Bernoulli process implemented via the Indian Buffet process (IBP) [15] – to model an unbounded number of mutually independent latent factors. We first evaluate this IBP-VAE on three benchmark data sets (color-augmented MNIST [27], 3D Chairs [3], and dSprites [32]), where we qualitatively and quantitatively demonstrate its improved ability to disentangle a variety of discrete and continuous generative factors in comparison to state of the arts [17, 9]. Furthermore, supporting our theoretical intuition, we show that IBP-VAE was able 1) to achieve improved data reconstruction as well as improved independence within the learned posterior densities compared to the use of an independent Gaussian density, and 2) to achieve better disentanglement compared to the use of a complex density that does not consider independence.

We then further demonstrate in two distinct clinical data sets that – when combined with task labels – unsupervised learning of nuisance factors can help improve the extraction of task-relevant representations while facilitating the discovery of knowledge related to task decision-making. We considered a skin lesion image data set


where the primary task is to classify malignant skin lesion (melanoma) from benign lesion, challenged by the need to extract subtle features relevant to melanoma detection (

e.g., color and shape asymmetry) from a large variety of lesion features [2]. We also considered a clinical electrocardiogram (ECG) data set [36] where the primary task is to localize the origin of arrhythmia beat in the heart from the morphology of 12-lead ECG signals, challenged by an unknown number of nuisance factors including patient demographics, geometry, and pathology that affect ECG morphology through complex physiological processes. These challenges were evident from the limited performance of relevant supervised deep architectures on each data set, which we show could be improved by adding unsupervised disentangling of nuisance factors via IBP-VAE. Note that the effectiveness of disentangled representation learning, either fully unsupervised or combined with supervised tasks, has been little investigated in this type of clinical data sets.

To summarize, the main contributions of this work include:

  • Departing from current focus on the independence of latent factors for improving disentangled representation learning, we theoretically rationalize that a richer posterior approximation, with preserved independence, will improve disentangling of generative factors by indirectly reducing the tension between the disentangling and reconstructing capacity of VAE.

  • Via an IBP-VAE with an infinite latent factor posterior approximation, we qualitatively and quantitatively verify our hypothesis on widely-used benchmark data sets.

  • We further demonstrate – for the first time on clinical data sets little explored for disentangled representation learning – that unsupervised disentangling of nuisance factors will improve supervised tasks and facilitate discovery of semantic factors relevant to task decision-making.

Overall, while significant progresses of disentangled representation learning have been demonstrated on visual benchmarks with relatively well-known generative factors, its feasibility is little known in real-world data sets — such as clinical images and signals — where there is a large and often unknown number of generative factors with a complex relationship with the data. We hope this work will contribute to bringing unsupervised disentanglement learning towards this direction.

Ii Related Works

Recent developments of unsupervised disentangled representation learning are primarily considered in the context of deep generative models, such as VAE [23, 35] and generative adversarial networks (GAN) [13]. In -VAE [17], it was demonstrated that unsupervised disentanglement can be achieved by constraining the posterior density of the latent representation to be similar to an isotropic Gaussian prior with independence among each latent dimension. Following this line of rationale, better enforcing the independence among latent dimensions has been a main approach to improving disentangled learning in VAE. Examples include adding to the ELBO a penalty constraining the marginal posterior density to be similar to an independent prior [26], or directly penalizing the dependence within through a total-correlation term, , either isolated from the ELBO [9] or added to the ELBO objective [20]. In the context of GAN, it was also shown that maximizing the mutual information between the latent representation and data can help learning disentangled representations [10]. These disentangling-focused networks, however, do not consider the modeling capacity of the latent densities: on the contrary, using a common choice of independent Gaussian densities, the disentangling ability of these networks generally decreases as the number of generative factors in the data increases [17].

In parallel, it has been widely discussed that a Gaussian assumption for the posterior density may underestimate the required complexity of the marginal posterior of the latent representation [18, 1, 5]

. There has been an increased interest in enabling richer posterior approximations in VAE, including means to accommodate Gaussian mixture models


, autoregressive models

[11], flow-based models [24] and Bayesian nonparametric models in VAE [33, 14, 7, 37]. In specific, non-parametric IBP has been previously considered in VAE [7, 37]. However, while demonstrating an improvement in data reconstruction and posterior approximations, these works do not consider the role of richer posterior densities in learning disentangled representations.

The presented work can be seen as an attempt to bridge the above two lines of works. We theoretically rationalize that the independence within and the modeling capacity of the latent density are two separate effects when we regularize the posterior density: the former affects disentangling, while the latter affects data reconstruction. Alternative to directly manipulating independence, we bring a new perspective that richer posterior approximations, with preserved independence, will indirectly facilitate disentangling by reducing its competition with the reconstruction objective in ELBO. This is to our knowledge the first investigation of the role of posterior modeling capacity in disentangled representation learning.

Regarding the separation of nuisance factors for learning task-related representations, this work is marginally related to fair representation learning [43, 30] and its applications in confounder filtering [41]. The notion of learning fair representations was introduced in [43] and later extended to VAE in [30] to obfuscate observed confounding attributes, such as age or sex [43], from the learned representation. In application to health-care domain, an approach was presented in [41] to remove the effect of confounding factors by identifying network weights that are associated with confounding factors in the pre-trained model. All of these works, however, focus on removing a small number of observed nuisance factors while the presented work considers disentangling an unknown number of unobserved nuisance factors.

Iii Methodology

Iii-a Preliminaries: Beta-Bernoulli Process

The Beta-Bernoulli process is a stochastic process that defines a probability distribution over a sparse binary matrix indicating feature activation for

features. The generative Beta-Bernoulli process taking the limit is also referred to as the IBP [15]. The infinite binary sparse matrix represents latent feature allocation, where is 1 if feature is active for the sample and 0 otherwise. For practical implementations, stick-breaking construction [39] is considered where the samples are drawn as:


where the hyperparameter

represents the expected number of features in the data.

Iii-B Theoretical Intuition

As introduced earlier, the ELBO objective (1) consists of data reconstruction regularized by some constraints on the posterior density. Independence of the posterior density has been one constraint shown to be effective in improving disentangling [17, 8, 20]. To examine the role of other properties of the posterior pdf in disentangling, we delve further into ELBO following the decomposition in [18, 9]:


As shown, when minimizing the KL-divergence between the posterior and an independent prior density in (1), two constraints take effect: we not only promote the independence within (total-correlation term in (3)), but also constraining the shape and complexity of (the 3rd and 4th term in (3)). While the former promotes the disentangling ability of VAE, the latter – if overly limited – creates an unnecessary competition with the data reconstruction objective in ELBO (the 1st term in (3)). Therefore, if we preserve the independence but allow richer modeling capacity in the posterior density, we will lift this competition and thereby allow improved independence and data reconstruction at the same time. This is the theoretical basis of the presented hypothesis that an independent latent factor model with increased modeling capacity will improve disentangled representation learning in VAE. Below, we investigate this hypothesis with an IBP-VAE where the complexity of the posterior density is able to grow with the complexity of the data.

Iii-C Disentangling IBP-VAE

Iii-C1 Generative model

We assume that data is generated by latent representations that follows a non-parametric IBP prior:


where , is element-wise product, and is the number of data samples. This representation essentially allows the model to infer which latent features captured by is active for the observed data

. As the active factors for each data point are inferred and not fixed, this non-parametric model is able to grow with the complexity of the data.

As defined in (4), the IBP assumes that each data point possesses feature with independently-generated probability . Each is also modeled as a product of

mutually independent Bernoulli distributions. Furthermore, each

is also modeled with independent dimensions via an isotropic Gaussian density. The latent representation , as an element-wise product between and , is therefore also independent among each feature dimension. This provides a latent factor model that is independent among dimensions but with a high modeling capacity. We then model the likelihood to be Gaussian (real-valued observations) or Bernoulli (binary observations) parameterized by neural networks.

Iii-C2 Inference model

We introduce a variational approximation of the posterior density :


where we use the Concrete distribution [19, 31] to approximate the Bernoulli distribution, and use the Kumaraswamy distribution [33]

to approximate the Beta distribution (more details in the Appendix

A-A and Appendix A-B). are parameterized by and , and , and are parameterized by neural networks. This gives rise to the presented IBP-VAE architecture as illustrated in Fig. 1.

Fig. 1: Outline of the presented IBP-VAE (blue) for unsupervised disentangled representation learning, and cIBP-VAE (blue and orange) for the combination with supervised task learning.

Iii-C3 Variational inference

We derive the ELBO, obtained by minimizing the KL divergence between the true posterior and the approximated posterior, for IBP-VAE as:


where is optimized with respect to the network weights as well as parameters and . This objective function can be interpreted as minimizing a reconstruction error along with minimizing the KL divergence between the variational posteriors and the corresponding priors in the remaining terms.

Iii-D Learning task representations

We further consider the use of disentangled representation learning in supervised learning of task with labels. As illustrated in Fig. 

1, we split the latent representation of data into and . The former represents the nuisance factors that will be modeled with the IBP density and learned in an unsupervised manner, while the latter is the task-related representation that will be supervised with the task label. The likelihood function is now expressed as and as before is parameterized by the decoder network. We encode the nuisance factors through the stochastic encoder as described earlier, and the task-representation with a deterministic encoder parameterized by . We utilize the task label by extending the unsupervised objective in equation (6) with a supervised classification loss on the task representation:


where the hyper-parameter controls the relative weight between the generative and discriminative learning, and is the label predictive distribution [22] approximated by the deterministic encoder. We refer this extension as cIBP-VAE throughout this paper.

Fig. 2: [Best viewed in color] (a)-(e): Images generated by traversal along a single latent unit (over a range of [-3, 3]) on the latent representation encoded from a random sample (each row). (f): Triggering capacity of the IBP-VAE: column one: original images; column two: reconstructed images; column three: reconstructed images after deactivating the triggering unit. The schematic boxes illustrate active (green), de-activated (red), and inactive (grey) units of .
Fig. 3: (Top) Learned latent variables using -VAE and -IBP-VAE for the traversal range of (-1, 1). (Bottom) Triggering capacity of the -IBP-VAE where the three rows give examples of the original images, reconstructed images, and reconstructed images with the triggering unit for leg styles de-activated.
Fig. 4: [Best viewed in color] Disentangling performance of -IBP-VAE compared with -VAE and -VampPrior at = 5. The best result reported in literature (rightmost plot, reprinted from -TCVAE [9] with permission) is also presented for comparison. Each plot shows the relationship between each learned latent dimension (row) and each ground-truth factor (column): color in column one encodes high to low values from blue to red; colored lines in column two and three represent different object shapes. The MIG score (the higher the better) is given for each model.

Iv Experiments

We performed three sets of experiments in five distinct data sets. This includes three widely-used benchmark data sets for unsupervised disentangled representation learning, and two real-world clinical data sets with their respective tasks of interest. Across all data sets, we evaluated the disentangling performance of the presented IBP-VAE in comparison to VAE using a standard isotropic Gaussian prior, varying the regularization parameter for the KL penalty term in both settings (i.e., similar to -VAE [17], we use the term -IBP-VAE when is used with IBP-VAE). In the quantitative analysis of disentanglement, we further included comparisons to VAE that uses a complex prior in the form of VampPrior [40]. In the two clinical data sets, we also evaluated the performance of cIBP-VAE in the respective clinical task in comparison to supervised deep networks as well as c-VAE (similar to cIBP-VAE except the nuisance factor follows a Gaussian prior).

Given the diversity of data sets being considered, we leave data and implementation details to each subsection. In all experiments, implementations of VAE (or c-VAE) adopted the same parameters and architecture as IBP-VAE (or cIBP-VAE) whenever possible. Architecture details of other baseline models are included in the Appendix B. In all experiments except dSprites, a validation set was used to select the hyper-parameters. In dSprites, we followed standard practice [9]

and quantified disentanglement in the training data. All networks were implemented with PyTorch and optimized with Adam

[21]. All statistical tests were based on paired Student’s t-tests.

Iv-a Qualitative benchmarks

Iv-A1 Colored MNIST

We augmented the binary MNIST data set [27] by adding red, green and blue color to 3/4 of the white characters, resulting in 4 types of colors in the data set with an input size of 2352 (3*28*28). This added a discrete nuisance factor to the inherent style

variations in the original data set. We focused on the ability of IBP-VAE to disentangle color and other style variations in comparison to VAE. Both the encoder and decoder consisted of two hidden layers of 500 units, each with ReLU activation.

and in (5) were further obtained with one hidden layer, with the truncation number set to and parameter set to for the Beta distribution. For optimization, we used a learning rate of 1e-4.

Fig. 2 gives examples of latent space traversal of the trained IBP-VAE and VAE. As shown, IBP-VAE disentangled semantically meaningful factors such as rotation (a), digit type (c), stroke width (e), and color (f). In specific, IBP-VAE learned to encode the presence of font color by the activation of a specific latent unit: de-activation of this unit could independently remove the font color as demonstrated in Fig. 2(f). We refer to this as a triggering unit in the rest of the paper. In comparison, VAE was not able to disentangle color from either rotation (b) or digit type (d), nor was it able to extract the generative factor of stroke width.

Iv-A2 3D Chairs

The data set of 3D chairs [3], extensively considered for qualitative demonstration of disentangled representation learning, comprises of factors of variations such as rotation, width, and leg style of the chairs. Here, we compared the disentangling ability of -IBP-VAE to -VAE using the same experimental setup and network architecture as [17].

Fig. 3 shows the results of latent space traversal of -IBP-VAE and -VAE at , the value at which we obtained the best results for -VAE. Similar to what was shown in [17], -VAE captured three factors of variation including azimuth, width, and leg style. In comparison, -IBP-VAE was able to disentangle the same three factors, along with an additional generative factor: the height of the chair. Moreover, -IBP-VAE seemed to have found a binary triggering unit that swaps between two different leg styles (Fig. 3 bottom panel).

Iv-B Quantitative benchmark

We quantitatively evaluated IBP-VAE in two aspects.

We first considered quantitative metrics recently proposed to measure disentanglement against available ground-truth factors of variation. In particular, we considered the metric of mutual information gap (MIG) [9]

that measures the normalized gap in mutual information between the top two dimensions in the latent vector that are most sensitive for each ground-truth factor. It is considered to addresses some of the limitations of previous metrics, including that it is unbiased to hyperparameter settings and applicable to any latent distributions


The second analysis was inspired by the rate-distortion (RD) analysis introduced in [1], which characterizes the competition between the first reconstruction term (distortion) and the second KL-divergence term (rate) in the ELBO objective (1). Here, we further narrowed down the RD analysis to focus on the competition between the reconstruction and disentangling ability of the VAE. To do so, we singled out the total-correlation (TC) term from the rate term as shown in (3), which measures the independence within the learned latent factors. We then contrasted it with the distortion term similar to the R-D analysis: we term this as TC-D analysis.

Both quantitative analyses were carried out on the dSprites [32] data set that consists of 737,280 synthetic images (6464) of 2D shapes with five known generative factors: scale, rotation, x-position, y-position and shape. Beside VAE with a regular Gaussian prior, we also compared IBP-VAE with VAE with a complex prior in the form of VampPrior [40]. For all models, we adopted the CNN encoder-decoder architecture from [9] with a latent dimension of 10 all models (details in the Appendix B), and we varied the value of for the penalizing KL terms. A learning rate of 5e-4 and value of 10 were used. Other hyper-parameters required for VampPrior were used as the standard implementation provided by [40].

-VAE VampPrior IBP-VAE
1 0.1890 0.1305 0.4174
5 0.4786 0.4848 0.5477
10 0.4661 0.4676 0.485
TABLE I: The disentanglement score given by mutual information gap (MIG) from -VAE, -VampPrior and -IBP-VAE.
Fig. 5: TC-D analyses for -VAE, -IBP-VAE, and -VampPrior on dSprites. The (TC, D) value obtained from each model is plotted for three different values.

Iv-B1 MIG scores

Table I compares the MIG disentanglement scores of -VAE, -VampPrior, and -IBP-VAE at different values of . Fig. 4 visualizes the disentanglement performance of these models at , along with the best results adopted from [9]. Each plot shows the relationship between the learned latent dimension (row) and the ground-truth factors (column): a successfully disentangled latent dimension should vary with only one ground-truth factor.

As is evident both visually and quantitatively, across and , -IBP-VAE achieved better disentanglement than the other two models. For instance, at (Fig. 4), -IBP-VAE was able to clearly separate the rotation, scale, and position. In comparison, -VAE heavily entangled rotation with position, while -VampPrior captured the factor of rotation in two separate dimensions. Notably, at =5, the MIG score reported by -IBP-VAE surpassed the best result reported in -TCVAE [9], a state-of-the-art disentangling VAE that improves over -VAE by penalizing only the TC term.

Iv-B2 TC-D analysis

In Fig. 5, we present the TC-D analysis for the three models considered. As we increased the value, the distortion (D) for all the models increased or, in other words, the ability of the model to reconstruct decreased. In the mean time, total correlation (TC) decreased, improving the independence among latent factors and hence helping in disentanglement. In comparison to -VAE with a regular Gaussian density, -IBP-VAE was able to achieve lower distortion (better reconstruction) as well as lower or comparable TC (better or comparable independence) across all values of . This verified our hypothesis that enabling richer yet independent posterior approximations was able to reduce the competition between the reconstructing and disentangling ability of VAE, allowing simultaneous improvement in both terms. VAE with the VampPrior, as expected, obtained the best reconstructions throughout all values of due to the use of a complex density. Without explicitly considering independence in the density, however, it resulted in decreased disentanglement compared to IBP-VAE, as measured by both the higher TC values (Fig. 5) and the lower MIG values (Table I) across all values of .

CNN (AlexNet) 82.59 0.75 -
c-VAE 81.79 0.73 1521.05
cIBP-VAE 83.11 0.79 1096.55
TABLE II: Lesion classification accuray (AC), area under the ROC curve (AUC), and reconstruction mean square error (MSE) of cIBP-VAE, c-VAE, and baseline AlexNet.
Fig. 6: ROC curves of cIBP-VAE in comparison to alternative models for classification of melanoma and benign lesions.
Fig. 7: [Best viewed in color] (a) Reconstruction examples of cIBP-VAE and cVAE, along with MSE values. (b) Column one: original images; column two: difference in reconstruction after switching the lesion label from malignant to benign (or vice versa); column three: overlay of reconstruction difference (green) with original images. (c) Visual and quantitative reconstruction difference before (row two) and after (row three) de-activating the triggering unit. (d) Images generated by traversing a single latent unit over the [-5,5] range.

Iv-C Real-world clinical dataset

Iv-C1 Skin lesion analysis

ISIC 2016 [16] is a public benchmark challenge data set consisting of dermoscopic images of skin diseases released to support the development of melanoma diagnosis algorithms. Here, we considered the task of classification of dermoscopic images into melanoma (malignant) and benign categories. The challenge of this task lies in the need to extract subtle features relevant to melanoma detection, such as color and shape asymmetry [38], from a large variety of lesion features. To be able to interpret semantically what factors did and did not contribute to the classification, therefore, is also important for the diagnosis decision.

We used the given training and test with a size 900 and 379 images respectively. We further split a random 20% of the training set for validation. Following pre-processing in [28], we cropped the center portion of dermoscopic images and proportionally resized the cropped area to 256256. We used the AlexNet [25]

, pretrained on ImageNet dataset, as the supervised baseline in this data set. The encoder in cIBP-VAE and c-VAE used the AlexNet to extract features, which were then factorized into

and via two hidden layers on each branch. For , both hidden layers used a size of 4096 and the truncation number was set to 50. For , the two hidden layers used a size of 100 and 2 (representing class scores). For the decoder, we used the deep convolution architecture (details in Appendix B). The value of in (7) was set to 5 with a warm-up

of 100 for 300 epochs

[4] and learning rate set to 1e-4.

Task accuracy: Table II summarizes the lesion classification performance of cIBP-VAE in comparision to the baseline discriminative AlexNet and c-VAE, using the two metrics recommended by ISIC for this task. The receiver operating characteristic (ROC) curves for all the models are also presented in Fig. 6. As shown, while c-VAE decreased the task performance in comparison to AlexNet, cIBP-VAE significantly improved the lesion classification accuracy ( 0.04) and improved the ROC-AUC score from 0.75 (AlexNet) to 0.79. This suggests that unsupervised disentangling of nuisance factors could improve task accuracy if the nuisance factors are properly learned, and that VAE with a regular Gaussian density may have a limited ability to disentangle this data set given the complex factors of variations.

Uncovering and disentangling latent factors: To first compare the amount of factors of variations that could be captured by cIBP-VAE vs. c-VAE, we compared the reconstruction accuracy of both models in test data. Table II (third column) shows that the reconstruction error of cIBP-VAE was significantly lower (). Examples in Fig. 7(a) show that cIBP-VAE was particularly better at preserving the detailed color distribution in the skin lesion, which is known to be important for melanoma detection [38].

To interpret the task-relevant representation learned by cIBP-VAE, we took the nuisance representation encoded by the cIBP-VAE from a test image and combined it with an opposite image label for reconstruction. We expected the difference between the original and reconstructed images to explain what has contributed to the classification. Fig. 7(b) gives three such examples. Interestingly, after switching the label of a melanoma image, the reconstruction difference primarily focused on regions with asymmetry color or atypical network within the lesion, providing visual support on the subtle characteristics that justified melanoma classification.

Finally, to interpret the nuisance factors learned by cIBP-VAE, we analyzed images generated by traversing along continuous factors and de-activating binary factors. In Fig. 7(c), we show that cIBP-VAE has learned a triggering unit whose activation controls local lesion color, as highlighted by the red circle and the change in reconstruction error. In Fig. 7(d), we show images generated by traversing along two different latent dimensions learned by cIBP-VAE over a wide range of [-5, 5]. The results demonstrate that cIBP-VAE has discovered and disentangled factors such as the size and location of the lesion that are generally irrelevant to the task of melanoma detection.

Iv-C2 Clinical 12-lead ECG

The ECG data set described in [36] was collected during invasive electrical stimulation in the hearts of 39 post-infarction patients: on each patient, 15-second 12-lead ECG recordings resulting from different stimulation locations were collected. Following pre-processing in  [8], the data set consists of 16848 ECG beats (12 100, 12 = number of leads; 100 = temporal samples), each with a labeled site of electrical stimulation in the form of one of the ten anatomical segments of the left ventricle. This data set was collected for the purpose of learning to localize the origin of ventricular activation from 12-lead ECG morphology, which can be useful for predicting the origin of abnormal rhythm in the heart and thus guiding treatment.

Model Seg. classification Seg. classification
(in %) with artifacts (in %)
CNN 53.89 52. 44
c-VAE 55.97 53.95
cIBP-VAE 57.53 56.97
TABLE III: Segment classification accuracy (with and without artifacts) of CNN, c-VAE, and cIBP-VAE on the test set.

This task is challenged by significant inter-subject variations in a wide range of factors such as heart and thorax anatomy, heart pathological remodeling, and surface electrode positioning, all of which affect ECG morphology [34]. Unlike visual disentangling in the last three data sets, these factors are also not directly visible on the data, but related to it through a complex physics-based process. To add a visual factor and to test the ability of cIBP-VAE to grow with the complexity of the data, we further augmented this data set by an artifact (of size 10 for each lead) – in the form of an artificial pacing stimulus – to 50% of randomly selected ECG data. The entire dataset was split into training, validation and test set, where no set shared data from the same patient.

The network architecture of IBP-VAE was identical to that used on colored-MNIST. For , a hidden layer of 10 units representing class scores was used. For nuisance factors

, batch-normalization was added after the encoded representations with

= 20 for Beta distribution and the truncation number set to 50. A learning rate of 1e-3 was used. For the weight hyperparameter in equation (7), values of were used to find the best model.

We compared cIBP-VAE to: 1) a supervised CNN with three-layered convolution blocks (dropout, 2d convolution, batch normalization, ReLU, and max-pool layer) followed by two fully connected layers, and 2) c-VAE with the same parameters and architecture of cIBP-VAE. The design choice of the supervised CNN was inspired by


Fig. 8: ROC curves of cIBP-VAE in comparison to alternative models on the clinical ECG data set.

Task accuracy with increasing data complexity: Table III compares the classification accuracy on the test set obtained by the three models. The limited performance of CNN showed the significant challenge introduced by inter-subject variations on this data set. By adding unsupervised disentangling of nuisance factors, both c-VAE and cIBP-VAE achieved a higher classification, although cIBP-VAE significantly outperformed c-VAE either with or without the signal artifact (). This improvement of performance is also summarized in the ROC curves in Fig. 8, along with the value of the area under the macro-average ROC curve.

It is also noteworthy that, while all models showed a decrease in classification accuracy when pacing artifacts were introduced to the data, cIBP-VAE exhibited the smallest margin of accuracy loss () in comparison to c-VAE () and CNN (), further demonstrating the advantage of IBP-VAE to grow with the complexity of the factors of variations in the data.

model all signal artifact segment
all non-stimulus stimulus
c-VAE 2293.23 3.20 3.91 2.49
cIBP-VAE 2273.65 0.45 0.19 0.72
TABLE IV: Reconstruction errors of cIBP-VAE vs. c-VAE for the entire signals (column 2) and for the artifact segment only (columns 3-5). The latter is respectively calculated for all samples (all), samples with no pacing artifact (non-stimulus), and samples with pacing artifacts (stimulus).
Fig. 9: 12-lead ECG traces where the pacing artifact is highlighted to the left side of the dotted line. (a) An original signal without the pacing artifact. (b) The reconstructed signal using cIBP-VAE. (c) The reconstructed signal using c-VAE.

Uncovering & disentangling latent factors: Because factors of inter-subject variations in the ECG data set cannot be labeled or directly visualized, here we focus on the ability of cIBP-VAE vs. c-VAE in uncovering and disentangling the binary factor of pacing artifacts in the augmented data set.

Fig. 10: [Best viewed in color] (a): Swapping between the task and nuisance representation from two samples (left) transferred the presence and absence of pacing artifact in the reconstructed signals (right).(b)-(d): Reconstructions before (c) and after (d) de-activation of the triggering unit, in comparison to original signals (b).

As shown in Table IV, while cIBP-VAE and c-VAE showed a similar accuracy in reconstructing ECG signals, their accuracy in reconstructing the small artifact segment differed significantly (). Fig. 9 shows an example where cIBP-VAE was able to reconstruct the absence of a pacing artifact while c-VAE was not. This shows that cIBP-VAE was able to capture more generative factors, in a data set already riddled with a wide variety of factors of variations.

To demonstrate the disentanglement of task and nuisance representations, we show in Fig. 10(a) that, when the encoded nuisance factors between a pair of signals were swapped, the absence and presence of the pacing artifacts were transferred as well. Furthermore, similar to previous data sets, cIBP-VAE has learned a triggering unit to encode the absence or presence of the signal artifact in ECG data. Fig. 10(b)-(d) show two examples, where de-activation of this triggering unit added a pacing artifact to the reconstructed signal. This showed that cIBP-VAE was able to disentangle the specific nuisance factor of signal artifact, not only from the task representation but also from other nuisance factors.

V Conclusion and Future Work

We presented a VAE model with a non-parametric independent latent factor model for unsupervised learning of disentangled representations. Departing from current focus on independence, we showed how an increased modeling capacity in the latent density will improve the disentangling ability of VAE, especially as the complexity of the generative factors increases in the data. We further showed how unsupervised disentangling of nuisance factors could improve supervised extraction of task representations as well as facilitate interpretability of the learned representations. These were demonstrated through state-of-the-art qualitative and quantitative results on widely-used benchmark data sets, as well as improved performance over supervised deep networks on clinical data sets that have been little explored for the effectiveness of disentangled presentation learning. An immediate future work would include the extension of the current work to more benchmark data sets with a large number of variations and performing the quantitative comparison using additional disentanglement models and metrics [29].

Appendix A Model

A-a Concrete distribution

During training of our presented IBP-VAE, we approximate Bernoulli random variables

with the Concrete distribution [31] which has a convenient parametrization:


where is the temperature parameter, is the sample from the Gumbel(0,1) distribution which again has a convenient sampling strategy using Uniform(0,1).

A-B Kumaraswamy distribution

For the posterior involving Beta distribution, we approximate using the Kumaraswamy distribution as shown by [33] where samples are drawn as:


where u Uniform(0,1) and 0. This distribution is equivalent to Beta distribution when = 1 or = 1 or both.

Appendix B Architecture Details

Layer Info (CNN Encoder for dSprites)

2dConv(1, 32, (4,4), 2, pad=1), ReLU

CNN 2dConv(32, 32, (4,4), 2, pad=1), ReLU
FC Linear layer(256), ReLU
FC Linear layer(256), ReLU
FC Linear layer(500), ReLU
, , diag() 3 * Linear layer(10)
Layer Info (CNN Decoder for dSprites)
FC Linear layer(256), ReLU
FC Linear layer(4 *4 * 32), ReLU
CNN 2d Transpose Conv(32, 32, (4,4), 2, pad=1), ReLU
CNN 2d Transpose Conv(32, 1, (4,4), 2, pad=1), Sigmoid
Layer Info (CNN Decoder for Clinical Skin Analysis dataset )
CNN 2d Transpose Conv(1, 64 * 8, (4,4), 1),
batchNorm, ReLU
CNN 2d Transpose Conv(64 * 8, 64 * 4, (4,4), 2, pad = 1),
batchNorm, ReLU
CNN 2d Transpose Conv(64 * 4, 64 * 2, (4,4), 4),
batchNorm, ReLU
CNN 2d Transpose Conv(64 * 2, 64, (4,4), 4),
batchNorm, ReLU
CNN 2d Transpose Conv(64, 3, (4,4), 2, pad = 1)


  • [1] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a broken elbo. In

    International Conference on Machine Learning

    pp. 159–168. Cited by: §II, §IV-B.
  • [2] G. Argenziano, G. Fabbrocini, P. Carli, V. De Giorgi, E. Sammarco, and M. Delfino (1998) Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Archives of dermatology 134 (12), pp. 1563–1570. Cited by: §I.
  • [3] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3762–3769. Cited by: §I, §IV-A2.
  • [4] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL. Cited by: §IV-C1.
  • [5] Y. Burda, R. Grosse, and R. Salakhutdinov (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §II.
  • [6] S. Chakraborty, R. Tomsett, R. Raghavendra, D. Harborne, M. Alzantot, F. Cerutti, M. Srivastava, A. Preece, S. Julier, R. M. Rao, et al. (2017) Interpretability of deep learning models: a survey of results. In IEEE Smart World Congress 2017 Workshop: DAIS, Cited by: §I.
  • [7] S. P. Chatzis (2014) Indian buffet process deep generative models. arXiv preprint arXiv:1402.3427. Cited by: §II.
  • [8] S. Chen, P. K. Gyawali, H. Liu, B. M. Horacek, J. L. Sapp, and L. Wang (2017) Disentangling inter-subject variations: automatic localization of ventricular tachycardia origin from 12-lead electrocardiograms. In Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, pp. 616–619. Cited by: §III-B, §IV-C2.
  • [9] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems 31, Cited by: §I, §I, §I, §I, §II, Fig. 4, §III-B, §IV-B1, §IV-B1, §IV-B, §IV-B, §IV.
  • [10] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §I, §II.
  • [11] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §II.
  • [12] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §II.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II.
  • [14] P. Goyal, Z. Hu, X. Liang, C. Wang, E. P. Xing, and C. Mellon (2017) Nonparametric variational auto-encoders for hierarchical representation learning.. In ICCV, pp. 5104–5112. Cited by: §II.
  • [15] T. L. Griffiths and Z. Ghahramani (2011) The indian buffet process: an introduction and review. Journal of Machine Learning Research 12 (Apr), pp. 1185–1224. Cited by: §I, §III-A.
  • [16] D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern (2016) Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1605.01397. Cited by: §I, §IV-C1.
  • [17] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-vae: learning basic visual concepts with a constrained variational framework. Cited by: §I, §I, §I, §II, §III-B, §IV-A2, §IV-A2, §IV.
  • [18] M. D. Hoffman and M. J. Johnson (2016) Elbo surgery: yet another way to carve up the variational evidence lower bound. In

    Workshop in Advances in Approximate Bayesian Inference, NIPS

    Cited by: §II, §III-B.
  • [19] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §III-C2.
  • [20] H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §I, §I, §II, §III-B.
  • [21] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
  • [22] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589. Cited by: §III-D.
  • [23] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Cited by: §I, §II.
  • [24] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §I, §II.
  • [25] A. Krizhevsky (2014)

    One weird trick for parallelizing convolutional neural networks

    arXiv preprint arXiv:1404.5997. Cited by: §IV-C1.
  • [26] A. Kumar, P. Sattigeri, and A. Balakrishnan (2017) Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §I, §I, §II.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I, §IV-A1.
  • [28] Y. Li and L. Shen (2018) Skin lesion analysis towards melanoma detection using deep learning network. Sensors 18 (2), pp. 556. Cited by: §IV-C1.
  • [29] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf, and O. Bachem (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §V.
  • [30] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel (2015) The variational fair autoencoder. arXiv preprint arXiv:1511.00830. Cited by: §II.
  • [31] C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    arXiv preprint arXiv:1611.00712. Cited by: §A-A, §III-C2.
  • [32] L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) DSprites: disentanglement testing sprites dataset. Note: Cited by: §I, §IV-B.
  • [33] E. Nalisnick and P. Smyth (2016) Stick-breaking variational autoencoders. arXiv preprint arXiv:1605.06197. Cited by: §A-B, §II, §III-C2.
  • [34] R. Plonsey (1969) Bioelectric phenomena. Wiley Online Library. Cited by: §IV-C2.
  • [35] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1278–1286. Cited by: §I, §II.
  • [36] J. L. Sapp, M. Bar-Tal, A. J. Howes, J. E. Toma, A. El-Damaty, J. W. Warren, P. J. MacInnis, S. Zhou, and B. M. Horáček (2017) Real-time localization of ventricular tachycardia origin from the 12-lead electrocardiogram. JACC: Clinical Electrophysiology 3 (7), pp. 687–699. Cited by: §I, §IV-C2.
  • [37] R. Singh, J. Ling, and F. Doshi-Velez (2017) Structured variational autoencoders for the beta-bernoulli process. NIPS 2017 Workshop on Advances in Approximate Bayesian Inference. Cited by: §II.
  • [38] H. P. Soyer, G. Argenziano, I. Zalaudek, R. Corona, F. Sera, R. Talamini, F. Barbato, A. Baroni, L. Cicale, A. Di Stefani, et al. (2004) Three-point checklist of dermoscopy. Dermatology 208 (1), pp. 27–31. Cited by: §IV-C1, §IV-C1.
  • [39] Y. W. Teh, D. Grür, and Z. Ghahramani (2007) Stick-breaking construction for the indian buffet process. In Artificial Intelligence and Statistics, pp. 556–563. Cited by: §III-A.
  • [40] J. M. Tomczak and M. Welling (2018) VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, AISTATS, Cited by: §I, §IV-B, §IV.
  • [41] Z. Wu, H. Wang, M. Cao, Y. Chen, and E. P. Xing (2018) Fair deep learning prediction for healthcare applications with confounder filtering. arXiv preprint arXiv:1803.07276. Cited by: §II.
  • [42] T. Yang, L. Yu, Q. Jin, L. Wu, and B. He (2018) Localization of origins of premature ventricular contraction by means of convolutional neural network from 12-lead ecg. IEEE Transactions on Biomedical Engineering 65 (7), pp. 1662–1671. Cited by: §IV-C2.
  • [43] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013) Learning fair representations. In International Conference on Machine Learning, Cited by: §II.