I Introduction
An inherent goal in deep learning is to distill taskrelevant latent representations that are invariant to other nuisance factors in the data. Stateoftheart deep neural networks achieve this by careful engineering of the network architecture, along with supervised training with a large number of task labels. However, the effectiveness of supervised training relies heavily on data quantity and label quality, especially in data with a wide range of dataspecific factors of variations. Moreover, interpreting the results of these networks – important in areas such as clinical tasks – remains challenging
[6].Unsupervised disentangled representation learning provides a taskagnostic approach to learn latent generative factors that are semantically interpretable and mutually invariant
[17, 9, 20, 26, 10]. Many of recent successes in this area are based on variational autoencoders (VAE), which modernize variational inference by using neural networks to parameterize both the likelihood of data given latent variable , , and the approximated posterior density of , [23, 35]. The objective of VAE training is thus to maximize the variational evidence lower bound (ELBO) of the marginal data likelihood:(1) 
where the first term can be interpreted as data reconstruction, while the second penalty term constrains the approximated posterior density to be similar to a prior by minimizing their KullbackLeibler (KL) divergence.
To improve disentangled learning in VAE, the primary focus has been on enforcing independence among the learned latent factors, achieved by more heavily penalizing the distance from [17] or its marginal density [26] to a prior that is independent among dimensions. This may be strengthened by an explicit independence penalty on , e.g., either added to the ELBO [20] or isolated from the ELBO through totalcorrelation decomposition [9]. These investigations, however, are carried out in the context of a Gaussian approximation of the posterior density, limiting its ability to model generative factors with increased complexity [17].
In parallel, enabling richer posterior approximations has been an active topic of interest for improving data reconstructions in VAE [40, 24]. This is often achieved by designing more complex densities and/or with increased modeling power. Their effect on the disentangling ability of the VAE, however, has not been considered.
In this work, we investigate the littleexplored relationship between the modeling capacity of a posterior density and the disentangling ability of the VAE. Following [9], we note that when constraining a (marginal) posterior density to an independent prior, we enforce two effects: the independence among the latent factors, and the complexity of the density. The former has to do with disentangling, while the latter affects the modeling capacity of VAE: when enforcing an independent density with limited modeling capacity, the latter creates an unnecessary tension with the reconstructing objective (data likelihood). Therefore, alternative to directly reinforcing the independence, we rationalize that a richer modeling capacity will indirectly improve disentangling by reducing this tension.
Formally, we hypothesize that an independent latent factor model with increased modeling capacity will improve disentangled learning of generative factors with increased complexity. We investigate this theoretical intuition with a VAE model that utilizes a nonparametric Bayesian latent factor model – the Beta Bernoulli process implemented via the Indian Buffet process (IBP) [15] – to model an unbounded number of mutually independent latent factors. We first evaluate this IBPVAE on three benchmark data sets (coloraugmented MNIST [27], 3D Chairs [3], and dSprites [32]), where we qualitatively and quantitatively demonstrate its improved ability to disentangle a variety of discrete and continuous generative factors in comparison to state of the arts [17, 9]. Furthermore, supporting our theoretical intuition, we show that IBPVAE was able 1) to achieve improved data reconstruction as well as improved independence within the learned posterior densities compared to the use of an independent Gaussian density, and 2) to achieve better disentanglement compared to the use of a complex density that does not consider independence.
We then further demonstrate in two distinct clinical data sets that – when combined with task labels – unsupervised learning of nuisance factors can help improve the extraction of taskrelevant representations while facilitating the discovery of knowledge related to task decisionmaking. We considered a skin lesion image data set
[16]where the primary task is to classify malignant skin lesion (melanoma) from benign lesion, challenged by the need to extract subtle features relevant to melanoma detection (
e.g., color and shape asymmetry) from a large variety of lesion features [2]. We also considered a clinical electrocardiogram (ECG) data set [36] where the primary task is to localize the origin of arrhythmia beat in the heart from the morphology of 12lead ECG signals, challenged by an unknown number of nuisance factors including patient demographics, geometry, and pathology that affect ECG morphology through complex physiological processes. These challenges were evident from the limited performance of relevant supervised deep architectures on each data set, which we show could be improved by adding unsupervised disentangling of nuisance factors via IBPVAE. Note that the effectiveness of disentangled representation learning, either fully unsupervised or combined with supervised tasks, has been little investigated in this type of clinical data sets.To summarize, the main contributions of this work include:

Departing from current focus on the independence of latent factors for improving disentangled representation learning, we theoretically rationalize that a richer posterior approximation, with preserved independence, will improve disentangling of generative factors by indirectly reducing the tension between the disentangling and reconstructing capacity of VAE.

Via an IBPVAE with an infinite latent factor posterior approximation, we qualitatively and quantitatively verify our hypothesis on widelyused benchmark data sets.

We further demonstrate – for the first time on clinical data sets little explored for disentangled representation learning – that unsupervised disentangling of nuisance factors will improve supervised tasks and facilitate discovery of semantic factors relevant to task decisionmaking.
Overall, while significant progresses of disentangled representation learning have been demonstrated on visual benchmarks with relatively wellknown generative factors, its feasibility is little known in realworld data sets — such as clinical images and signals — where there is a large and often unknown number of generative factors with a complex relationship with the data. We hope this work will contribute to bringing unsupervised disentanglement learning towards this direction.
Ii Related Works
Recent developments of unsupervised disentangled representation learning are primarily considered in the context of deep generative models, such as VAE [23, 35] and generative adversarial networks (GAN) [13]. In VAE [17], it was demonstrated that unsupervised disentanglement can be achieved by constraining the posterior density of the latent representation to be similar to an isotropic Gaussian prior with independence among each latent dimension. Following this line of rationale, better enforcing the independence among latent dimensions has been a main approach to improving disentangled learning in VAE. Examples include adding to the ELBO a penalty constraining the marginal posterior density to be similar to an independent prior [26], or directly penalizing the dependence within through a totalcorrelation term, , either isolated from the ELBO [9] or added to the ELBO objective [20]. In the context of GAN, it was also shown that maximizing the mutual information between the latent representation and data can help learning disentangled representations [10]. These disentanglingfocused networks, however, do not consider the modeling capacity of the latent densities: on the contrary, using a common choice of independent Gaussian densities, the disentangling ability of these networks generally decreases as the number of generative factors in the data increases [17].
In parallel, it has been widely discussed that a Gaussian assumption for the posterior density may underestimate the required complexity of the marginal posterior of the latent representation [18, 1, 5]
. There has been an increased interest in enabling richer posterior approximations in VAE, including means to accommodate Gaussian mixture models
[12][11], flowbased models [24] and Bayesian nonparametric models in VAE [33, 14, 7, 37]. In specific, nonparametric IBP has been previously considered in VAE [7, 37]. However, while demonstrating an improvement in data reconstruction and posterior approximations, these works do not consider the role of richer posterior densities in learning disentangled representations.The presented work can be seen as an attempt to bridge the above two lines of works. We theoretically rationalize that the independence within and the modeling capacity of the latent density are two separate effects when we regularize the posterior density: the former affects disentangling, while the latter affects data reconstruction. Alternative to directly manipulating independence, we bring a new perspective that richer posterior approximations, with preserved independence, will indirectly facilitate disentangling by reducing its competition with the reconstruction objective in ELBO. This is to our knowledge the first investigation of the role of posterior modeling capacity in disentangled representation learning.
Regarding the separation of nuisance factors for learning taskrelated representations, this work is marginally related to fair representation learning [43, 30] and its applications in confounder filtering [41]. The notion of learning fair representations was introduced in [43] and later extended to VAE in [30] to obfuscate observed confounding attributes, such as age or sex [43], from the learned representation. In application to healthcare domain, an approach was presented in [41] to remove the effect of confounding factors by identifying network weights that are associated with confounding factors in the pretrained model. All of these works, however, focus on removing a small number of observed nuisance factors while the presented work considers disentangling an unknown number of unobserved nuisance factors.
Iii Methodology
Iiia Preliminaries: BetaBernoulli Process
The BetaBernoulli process is a stochastic process that defines a probability distribution over a sparse binary matrix indicating feature activation for
features. The generative BetaBernoulli process taking the limit is also referred to as the IBP [15]. The infinite binary sparse matrix represents latent feature allocation, where is 1 if feature is active for the sample and 0 otherwise. For practical implementations, stickbreaking construction [39] is considered where the samples are drawn as:(2)  
where the hyperparameter
represents the expected number of features in the data.IiiB Theoretical Intuition
As introduced earlier, the ELBO objective (1) consists of data reconstruction regularized by some constraints on the posterior density. Independence of the posterior density has been one constraint shown to be effective in improving disentangling [17, 8, 20]. To examine the role of other properties of the posterior pdf in disentangling, we delve further into ELBO following the decomposition in [18, 9]:
(3)  
As shown, when minimizing the KLdivergence between the posterior and an independent prior density in (1), two constraints take effect: we not only promote the independence within (totalcorrelation term in (3)), but also constraining the shape and complexity of (the 3rd and 4th term in (3)). While the former promotes the disentangling ability of VAE, the latter – if overly limited – creates an unnecessary competition with the data reconstruction objective in ELBO (the 1st term in (3)). Therefore, if we preserve the independence but allow richer modeling capacity in the posterior density, we will lift this competition and thereby allow improved independence and data reconstruction at the same time. This is the theoretical basis of the presented hypothesis that an independent latent factor model with increased modeling capacity will improve disentangled representation learning in VAE. Below, we investigate this hypothesis with an IBPVAE where the complexity of the posterior density is able to grow with the complexity of the data.
IiiC Disentangling IBPVAE
IiiC1 Generative model
We assume that data is generated by latent representations that follows a nonparametric IBP prior:
(4)  
where , is elementwise product, and is the number of data samples. This representation essentially allows the model to infer which latent features captured by is active for the observed data
. As the active factors for each data point are inferred and not fixed, this nonparametric model is able to grow with the complexity of the data.
As defined in (4), the IBP assumes that each data point possesses feature with independentlygenerated probability . Each is also modeled as a product of
mutually independent Bernoulli distributions. Furthermore, each
is also modeled with independent dimensions via an isotropic Gaussian density. The latent representation , as an elementwise product between and , is therefore also independent among each feature dimension. This provides a latent factor model that is independent among dimensions but with a high modeling capacity. We then model the likelihood to be Gaussian (realvalued observations) or Bernoulli (binary observations) parameterized by neural networks.IiiC2 Inference model
We introduce a variational approximation of the posterior density :
(5)  
where we use the Concrete distribution [19, 31] to approximate the Bernoulli distribution, and use the Kumaraswamy distribution [33]
to approximate the Beta distribution (more details in the Appendix
AA and Appendix AB). are parameterized by and , and , and are parameterized by neural networks. This gives rise to the presented IBPVAE architecture as illustrated in Fig. 1.IiiC3 Variational inference
We derive the ELBO, obtained by minimizing the KL divergence between the true posterior and the approximated posterior, for IBPVAE as:
(6) 
where is optimized with respect to the network weights as well as parameters and . This objective function can be interpreted as minimizing a reconstruction error along with minimizing the KL divergence between the variational posteriors and the corresponding priors in the remaining terms.
IiiD Learning task representations
We further consider the use of disentangled representation learning in supervised learning of task with labels. As illustrated in Fig.
1, we split the latent representation of data into and . The former represents the nuisance factors that will be modeled with the IBP density and learned in an unsupervised manner, while the latter is the taskrelated representation that will be supervised with the task label. The likelihood function is now expressed as and as before is parameterized by the decoder network. We encode the nuisance factors through the stochastic encoder as described earlier, and the taskrepresentation with a deterministic encoder parameterized by . We utilize the task label by extending the unsupervised objective in equation (6) with a supervised classification loss on the task representation:(7) 
where the hyperparameter controls the relative weight between the generative and discriminative learning, and is the label predictive distribution [22] approximated by the deterministic encoder. We refer this extension as cIBPVAE throughout this paper.
Iv Experiments
We performed three sets of experiments in five distinct data sets. This includes three widelyused benchmark data sets for unsupervised disentangled representation learning, and two realworld clinical data sets with their respective tasks of interest. Across all data sets, we evaluated the disentangling performance of the presented IBPVAE in comparison to VAE using a standard isotropic Gaussian prior, varying the regularization parameter for the KL penalty term in both settings (i.e., similar to VAE [17], we use the term IBPVAE when is used with IBPVAE). In the quantitative analysis of disentanglement, we further included comparisons to VAE that uses a complex prior in the form of VampPrior [40]. In the two clinical data sets, we also evaluated the performance of cIBPVAE in the respective clinical task in comparison to supervised deep networks as well as cVAE (similar to cIBPVAE except the nuisance factor follows a Gaussian prior).
Given the diversity of data sets being considered, we leave data and implementation details to each subsection. In all experiments, implementations of VAE (or cVAE) adopted the same parameters and architecture as IBPVAE (or cIBPVAE) whenever possible. Architecture details of other baseline models are included in the Appendix B. In all experiments except dSprites, a validation set was used to select the hyperparameters. In dSprites, we followed standard practice [9]
and quantified disentanglement in the training data. All networks were implemented with PyTorch and optimized with Adam
[21]. All statistical tests were based on paired Student’s ttests.Iva Qualitative benchmarks
IvA1 Colored MNIST
We augmented the binary MNIST data set [27] by adding red, green and blue color to 3/4 of the white characters, resulting in 4 types of colors in the data set with an input size of 2352 (3*28*28). This added a discrete nuisance factor to the inherent style
variations in the original data set. We focused on the ability of IBPVAE to disentangle color and other style variations in comparison to VAE. Both the encoder and decoder consisted of two hidden layers of 500 units, each with ReLU activation.
and in (5) were further obtained with one hidden layer, with the truncation number set to and parameter set to for the Beta distribution. For optimization, we used a learning rate of 1e4.Fig. 2 gives examples of latent space traversal of the trained IBPVAE and VAE. As shown, IBPVAE disentangled semantically meaningful factors such as rotation (a), digit type (c), stroke width (e), and color (f). In specific, IBPVAE learned to encode the presence of font color by the activation of a specific latent unit: deactivation of this unit could independently remove the font color as demonstrated in Fig. 2(f). We refer to this as a triggering unit in the rest of the paper. In comparison, VAE was not able to disentangle color from either rotation (b) or digit type (d), nor was it able to extract the generative factor of stroke width.
IvA2 3D Chairs
The data set of 3D chairs [3], extensively considered for qualitative demonstration of disentangled representation learning, comprises of factors of variations such as rotation, width, and leg style of the chairs. Here, we compared the disentangling ability of IBPVAE to VAE using the same experimental setup and network architecture as [17].
Fig. 3 shows the results of latent space traversal of IBPVAE and VAE at , the value at which we obtained the best results for VAE. Similar to what was shown in [17], VAE captured three factors of variation including azimuth, width, and leg style. In comparison, IBPVAE was able to disentangle the same three factors, along with an additional generative factor: the height of the chair. Moreover, IBPVAE seemed to have found a binary triggering unit that swaps between two different leg styles (Fig. 3 bottom panel).
IvB Quantitative benchmark
We quantitatively evaluated IBPVAE in two aspects.
We first considered quantitative metrics recently proposed to measure disentanglement against available groundtruth factors of variation. In particular, we considered the metric of mutual information gap (MIG) [9]
that measures the normalized gap in mutual information between the top two dimensions in the latent vector that are most sensitive for each groundtruth factor. It is considered to addresses some of the limitations of previous metrics, including that it is unbiased to hyperparameter settings and applicable to any latent distributions
[9].The second analysis was inspired by the ratedistortion (RD) analysis introduced in [1], which characterizes the competition between the first reconstruction term (distortion) and the second KLdivergence term (rate) in the ELBO objective (1). Here, we further narrowed down the RD analysis to focus on the competition between the reconstruction and disentangling ability of the VAE. To do so, we singled out the totalcorrelation (TC) term from the rate term as shown in (3), which measures the independence within the learned latent factors. We then contrasted it with the distortion term similar to the RD analysis: we term this as TCD analysis.
Both quantitative analyses were carried out on the dSprites [32] data set that consists of 737,280 synthetic images (6464) of 2D shapes with five known generative factors: scale, rotation, xposition, yposition and shape. Beside VAE with a regular Gaussian prior, we also compared IBPVAE with VAE with a complex prior in the form of VampPrior [40]. For all models, we adopted the CNN encoderdecoder architecture from [9] with a latent dimension of 10 all models (details in the Appendix B), and we varied the value of for the penalizing KL terms. A learning rate of 5e4 and value of 10 were used. Other hyperparameters required for VampPrior were used as the standard implementation provided by [40].
MIG  

VAE  VampPrior  IBPVAE  
1  0.1890  0.1305  0.4174 
5  0.4786  0.4848  0.5477 
10  0.4661  0.4676  0.485 
IvB1 MIG scores
Table I compares the MIG disentanglement scores of VAE, VampPrior, and IBPVAE at different values of . Fig. 4 visualizes the disentanglement performance of these models at , along with the best results adopted from [9]. Each plot shows the relationship between the learned latent dimension (row) and the groundtruth factors (column): a successfully disentangled latent dimension should vary with only one groundtruth factor.
As is evident both visually and quantitatively, across and , IBPVAE achieved better disentanglement than the other two models. For instance, at (Fig. 4), IBPVAE was able to clearly separate the rotation, scale, and position. In comparison, VAE heavily entangled rotation with position, while VampPrior captured the factor of rotation in two separate dimensions. Notably, at =5, the MIG score reported by IBPVAE surpassed the best result reported in TCVAE [9], a stateoftheart disentangling VAE that improves over VAE by penalizing only the TC term.
IvB2 TCD analysis
In Fig. 5, we present the TCD analysis for the three models considered. As we increased the value, the distortion (D) for all the models increased or, in other words, the ability of the model to reconstruct decreased. In the mean time, total correlation (TC) decreased, improving the independence among latent factors and hence helping in disentanglement. In comparison to VAE with a regular Gaussian density, IBPVAE was able to achieve lower distortion (better reconstruction) as well as lower or comparable TC (better or comparable independence) across all values of . This verified our hypothesis that enabling richer yet independent posterior approximations was able to reduce the competition between the reconstructing and disentangling ability of VAE, allowing simultaneous improvement in both terms. VAE with the VampPrior, as expected, obtained the best reconstructions throughout all values of due to the use of a complex density. Without explicitly considering independence in the density, however, it resulted in decreased disentanglement compared to IBPVAE, as measured by both the higher TC values (Fig. 5) and the lower MIG values (Table I) across all values of .
Model  AC  AUC  MSE 

CNN (AlexNet)  82.59  0.75   
cVAE  81.79  0.73  1521.05 
cIBPVAE  83.11  0.79  1096.55 
IvC Realworld clinical dataset
IvC1 Skin lesion analysis
ISIC 2016 [16] is a public benchmark challenge data set consisting of dermoscopic images of skin diseases released to support the development of melanoma diagnosis algorithms. Here, we considered the task of classification of dermoscopic images into melanoma (malignant) and benign categories. The challenge of this task lies in the need to extract subtle features relevant to melanoma detection, such as color and shape asymmetry [38], from a large variety of lesion features. To be able to interpret semantically what factors did and did not contribute to the classification, therefore, is also important for the diagnosis decision.
We used the given training and test with a size 900 and 379 images respectively. We further split a random 20% of the training set for validation. Following preprocessing in [28], we cropped the center portion of dermoscopic images and proportionally resized the cropped area to 256256. We used the AlexNet [25]
, pretrained on ImageNet dataset, as the supervised baseline in this data set. The encoder in cIBPVAE and cVAE used the AlexNet to extract features, which were then factorized into
and via two hidden layers on each branch. For , both hidden layers used a size of 4096 and the truncation number was set to 50. For , the two hidden layers used a size of 100 and 2 (representing class scores). For the decoder, we used the deep convolution architecture (details in Appendix B). The value of in (7) was set to 5 with a warmupof 100 for 300 epochs
[4] and learning rate set to 1e4.Task accuracy: Table II summarizes the lesion classification performance of cIBPVAE in comparision to the baseline discriminative AlexNet and cVAE, using the two metrics recommended by ISIC for this task. The receiver operating characteristic (ROC) curves for all the models are also presented in Fig. 6. As shown, while cVAE decreased the task performance in comparison to AlexNet, cIBPVAE significantly improved the lesion classification accuracy ( 0.04) and improved the ROCAUC score from 0.75 (AlexNet) to 0.79. This suggests that unsupervised disentangling of nuisance factors could improve task accuracy if the nuisance factors are properly learned, and that VAE with a regular Gaussian density may have a limited ability to disentangle this data set given the complex factors of variations.
Uncovering and disentangling latent factors: To first compare the amount of factors of variations that could be captured by cIBPVAE vs. cVAE, we compared the reconstruction accuracy of both models in test data. Table II (third column) shows that the reconstruction error of cIBPVAE was significantly lower (). Examples in Fig. 7(a) show that cIBPVAE was particularly better at preserving the detailed color distribution in the skin lesion, which is known to be important for melanoma detection [38].
To interpret the taskrelevant representation learned by cIBPVAE, we took the nuisance representation encoded by the cIBPVAE from a test image and combined it with an opposite image label for reconstruction. We expected the difference between the original and reconstructed images to explain what has contributed to the classification. Fig. 7(b) gives three such examples. Interestingly, after switching the label of a melanoma image, the reconstruction difference primarily focused on regions with asymmetry color or atypical network within the lesion, providing visual support on the subtle characteristics that justified melanoma classification.
Finally, to interpret the nuisance factors learned by cIBPVAE, we analyzed images generated by traversing along continuous factors and deactivating binary factors. In Fig. 7(c), we show that cIBPVAE has learned a triggering unit whose activation controls local lesion color, as highlighted by the red circle and the change in reconstruction error. In Fig. 7(d), we show images generated by traversing along two different latent dimensions learned by cIBPVAE over a wide range of [5, 5]. The results demonstrate that cIBPVAE has discovered and disentangled factors such as the size and location of the lesion that are generally irrelevant to the task of melanoma detection.
IvC2 Clinical 12lead ECG
The ECG data set described in [36] was collected during invasive electrical stimulation in the hearts of 39 postinfarction patients: on each patient, 15second 12lead ECG recordings resulting from different stimulation locations were collected. Following preprocessing in [8], the data set consists of 16848 ECG beats (12 100, 12 = number of leads; 100 = temporal samples), each with a labeled site of electrical stimulation in the form of one of the ten anatomical segments of the left ventricle. This data set was collected for the purpose of learning to localize the origin of ventricular activation from 12lead ECG morphology, which can be useful for predicting the origin of abnormal rhythm in the heart and thus guiding treatment.
Model  Seg. classification  Seg. classification 

(in %)  with artifacts (in %)  
CNN  53.89  52. 44 
cVAE  55.97  53.95 
cIBPVAE  57.53  56.97 
This task is challenged by significant intersubject variations in a wide range of factors such as heart and thorax anatomy, heart pathological remodeling, and surface electrode positioning, all of which affect ECG morphology [34]. Unlike visual disentangling in the last three data sets, these factors are also not directly visible on the data, but related to it through a complex physicsbased process. To add a visual factor and to test the ability of cIBPVAE to grow with the complexity of the data, we further augmented this data set by an artifact (of size 10 for each lead) – in the form of an artificial pacing stimulus – to 50% of randomly selected ECG data. The entire dataset was split into training, validation and test set, where no set shared data from the same patient.
The network architecture of IBPVAE was identical to that used on coloredMNIST. For , a hidden layer of 10 units representing class scores was used. For nuisance factors
, batchnormalization was added after the encoded representations with
= 20 for Beta distribution and the truncation number set to 50. A learning rate of 1e3 was used. For the weight hyperparameter in equation (7), values of were used to find the best model.We compared cIBPVAE to: 1) a supervised CNN with threelayered convolution blocks (dropout, 2d convolution, batch normalization, ReLU, and maxpool layer) followed by two fully connected layers, and 2) cVAE with the same parameters and architecture of cIBPVAE. The design choice of the supervised CNN was inspired by
[42].Task accuracy with increasing data complexity: Table III compares the classification accuracy on the test set obtained by the three models. The limited performance of CNN showed the significant challenge introduced by intersubject variations on this data set. By adding unsupervised disentangling of nuisance factors, both cVAE and cIBPVAE achieved a higher classification, although cIBPVAE significantly outperformed cVAE either with or without the signal artifact (). This improvement of performance is also summarized in the ROC curves in Fig. 8, along with the value of the area under the macroaverage ROC curve.
It is also noteworthy that, while all models showed a decrease in classification accuracy when pacing artifacts were introduced to the data, cIBPVAE exhibited the smallest margin of accuracy loss () in comparison to cVAE () and CNN (), further demonstrating the advantage of IBPVAE to grow with the complexity of the factors of variations in the data.
model  all signal  artifact segment  

all  nonstimulus  stimulus  
cVAE  2293.23  3.20  3.91  2.49 
cIBPVAE  2273.65  0.45  0.19  0.72 
Uncovering & disentangling latent factors: Because factors of intersubject variations in the ECG data set cannot be labeled or directly visualized, here we focus on the ability of cIBPVAE vs. cVAE in uncovering and disentangling the binary factor of pacing artifacts in the augmented data set.
As shown in Table IV, while cIBPVAE and cVAE showed a similar accuracy in reconstructing ECG signals, their accuracy in reconstructing the small artifact segment differed significantly (). Fig. 9 shows an example where cIBPVAE was able to reconstruct the absence of a pacing artifact while cVAE was not. This shows that cIBPVAE was able to capture more generative factors, in a data set already riddled with a wide variety of factors of variations.
To demonstrate the disentanglement of task and nuisance representations, we show in Fig. 10(a) that, when the encoded nuisance factors between a pair of signals were swapped, the absence and presence of the pacing artifacts were transferred as well. Furthermore, similar to previous data sets, cIBPVAE has learned a triggering unit to encode the absence or presence of the signal artifact in ECG data. Fig. 10(b)(d) show two examples, where deactivation of this triggering unit added a pacing artifact to the reconstructed signal. This showed that cIBPVAE was able to disentangle the specific nuisance factor of signal artifact, not only from the task representation but also from other nuisance factors.
V Conclusion and Future Work
We presented a VAE model with a nonparametric independent latent factor model for unsupervised learning of disentangled representations. Departing from current focus on independence, we showed how an increased modeling capacity in the latent density will improve the disentangling ability of VAE, especially as the complexity of the generative factors increases in the data. We further showed how unsupervised disentangling of nuisance factors could improve supervised extraction of task representations as well as facilitate interpretability of the learned representations. These were demonstrated through stateoftheart qualitative and quantitative results on widelyused benchmark data sets, as well as improved performance over supervised deep networks on clinical data sets that have been little explored for the effectiveness of disentangled presentation learning. An immediate future work would include the extension of the current work to more benchmark data sets with a large number of variations and performing the quantitative comparison using additional disentanglement models and metrics [29].
Appendix A Model
Aa Concrete distribution
During training of our presented IBPVAE, we approximate Bernoulli random variables
with the Concrete distribution [31] which has a convenient parametrization:(8) 
where is the temperature parameter, is the sample from the Gumbel(0,1) distribution which again has a convenient sampling strategy using Uniform(0,1).
AB Kumaraswamy distribution
For the posterior involving Beta distribution, we approximate using the Kumaraswamy distribution as shown by [33] where samples are drawn as:
(9) 
where u Uniform(0,1) and 0. This distribution is equivalent to Beta distribution when = 1 or = 1 or both.
Appendix B Architecture Details
Layer  Info (CNN Encoder for dSprites) 

CNN  2dConv(1, 32, (4,4), 2, pad=1), ReLU 
CNN  2dConv(32, 32, (4,4), 2, pad=1), ReLU 
FC  Linear layer(256), ReLU 
FC  Linear layer(256), ReLU 
FC  Linear layer(500), ReLU 
, , diag()  3 * Linear layer(10) 
Layer  Info (CNN Decoder for dSprites) 

FC  Linear layer(256), ReLU 
FC  Linear layer(4 *4 * 32), ReLU 
CNN  2d Transpose Conv(32, 32, (4,4), 2, pad=1), ReLU 
CNN  2d Transpose Conv(32, 1, (4,4), 2, pad=1), Sigmoid 
Layer  Info (CNN Decoder for Clinical Skin Analysis dataset ) 

CNN  2d Transpose Conv(1, 64 * 8, (4,4), 1), 
batchNorm, ReLU  
CNN  2d Transpose Conv(64 * 8, 64 * 4, (4,4), 2, pad = 1), 
batchNorm, ReLU  
CNN  2d Transpose Conv(64 * 4, 64 * 2, (4,4), 4), 
batchNorm, ReLU  
CNN  2d Transpose Conv(64 * 2, 64, (4,4), 4), 
batchNorm, ReLU  
CNN  2d Transpose Conv(64, 3, (4,4), 2, pad = 1) 
References

[1]
(2018)
Fixing a broken elbo.
In
International Conference on Machine Learning
, pp. 159–168. Cited by: §II, §IVB.  [2] (1998) Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7point checklist based on pattern analysis. Archives of dermatology 134 (12), pp. 1563–1570. Cited by: §I.

[3]
(2014)
Seeing 3d chairs: exemplar partbased 2d3d alignment using a large dataset of cad models.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3762–3769. Cited by: §I, §IVA2.  [4] (2016) Generating sentences from a continuous space. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL. Cited by: §IVC1.
 [5] (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §II.
 [6] (2017) Interpretability of deep learning models: a survey of results. In IEEE Smart World Congress 2017 Workshop: DAIS, Cited by: §I.
 [7] (2014) Indian buffet process deep generative models. arXiv preprint arXiv:1402.3427. Cited by: §II.
 [8] (2017) Disentangling intersubject variations: automatic localization of ventricular tachycardia origin from 12lead electrocardiograms. In Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, pp. 616–619. Cited by: §IIIB, §IVC2.
 [9] (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems 31, Cited by: §I, §I, §I, §I, §II, Fig. 4, §IIIB, §IVB1, §IVB1, §IVB, §IVB, §IV.
 [10] (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §I, §II.
 [11] (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731. Cited by: §II.
 [12] (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §II.
 [13] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II.
 [14] (2017) Nonparametric variational autoencoders for hierarchical representation learning.. In ICCV, pp. 5104–5112. Cited by: §II.
 [15] (2011) The indian buffet process: an introduction and review. Journal of Machine Learning Research 12 (Apr), pp. 1185–1224. Cited by: §I, §IIIA.
 [16] (2016) Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1605.01397. Cited by: §I, §IVC1.
 [17] (2016) Betavae: learning basic visual concepts with a constrained variational framework. Cited by: §I, §I, §I, §II, §IIIB, §IVA2, §IVA2, §IV.

[18]
(2016)
Elbo surgery: yet another way to carve up the variational evidence lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NIPS
, Cited by: §II, §IIIB.  [19] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §IIIC2.
 [20] (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §I, §I, §II, §IIIB.
 [21] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
 [22] (2014) Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589. Cited by: §IIID.
 [23] (2013) Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Cited by: §I, §II.
 [24] (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §I, §II.

[25]
(2014)
One weird trick for parallelizing convolutional neural networks
. arXiv preprint arXiv:1404.5997. Cited by: §IVC1.  [26] (2017) Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §I, §I, §II.
 [27] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I, §IVA1.
 [28] (2018) Skin lesion analysis towards melanoma detection using deep learning network. Sensors 18 (2), pp. 556. Cited by: §IVC1.
 [29] (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359. Cited by: §V.
 [30] (2015) The variational fair autoencoder. arXiv preprint arXiv:1511.00830. Cited by: §II.

[31]
(2016)
The concrete distribution: a continuous relaxation of discrete random variables
. arXiv preprint arXiv:1611.00712. Cited by: §AA, §IIIC2.  [32] (2017) DSprites: disentanglement testing sprites dataset. Note: https://github.com/deepmind/dspritesdataset/ Cited by: §I, §IVB.
 [33] (2016) Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197. Cited by: §AB, §II, §IIIC2.
 [34] (1969) Bioelectric phenomena. Wiley Online Library. Cited by: §IVC2.

[35]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pp. 1278–1286. Cited by: §I, §II.  [36] (2017) Realtime localization of ventricular tachycardia origin from the 12lead electrocardiogram. JACC: Clinical Electrophysiology 3 (7), pp. 687–699. Cited by: §I, §IVC2.
 [37] (2017) Structured variational autoencoders for the betabernoulli process. NIPS 2017 Workshop on Advances in Approximate Bayesian Inference. Cited by: §II.
 [38] (2004) Threepoint checklist of dermoscopy. Dermatology 208 (1), pp. 27–31. Cited by: §IVC1, §IVC1.
 [39] (2007) Stickbreaking construction for the indian buffet process. In Artificial Intelligence and Statistics, pp. 556–563. Cited by: §IIIA.
 [40] (2018) VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, AISTATS, Cited by: §I, §IVB, §IV.
 [41] (2018) Fair deep learning prediction for healthcare applications with confounder filtering. arXiv preprint arXiv:1803.07276. Cited by: §II.
 [42] (2018) Localization of origins of premature ventricular contraction by means of convolutional neural network from 12lead ecg. IEEE Transactions on Biomedical Engineering 65 (7), pp. 1662–1671. Cited by: §IVC2.
 [43] (2013) Learning fair representations. In International Conference on Machine Learning, Cited by: §II.
Comments
There are no comments yet.