1 Introduction
Developing generative models for complex data is an important task in modern machine learning. Deep generative models address this problem by using deep neural networks to parameterize the distribution of the data. These methods have seen great success in a variety of settings
shrivastava2017learning ; kingma2013auto . In this paper, we focus on multiaspect data, consisting of multiple views, modalities, or other definitions of grouped observations. For example, imagine we have text in multiple languages; datasets with text, images, and categorical responses; or data collected from multiple, potentially heterogeneous sensors. Building deep generative models for such multiaspect data still presents significant challenges.We tackle this problem within the context of variational autoencoders kingma2013auto
. These methods define complex encoder and decoder networks that map highdimensional observations to a lowdimensional space, and then back again. To learn these complex maps requires a tremendous amount of data. Additionally, each observation must be fully observed at both training and test time. Often, our large (or even small) datasets have significant missingness. For our multiaspect data, we assume missingness in the form of individual datapoints missing certain aspects (view, modality, etc.). For example, a sensor drops out, a translation is unavailable, or a modality is not present. The VAE is simply not applicable in such situations as it treats the collection of aspects as one large input. Instead, previous work has focused on imputing the missing values apriori
vedantam2017generative ; however, twostage approaches are statistically suboptimal, and especially problematic in the structured missing setting we focus on.We aim to leverage explicit correlations between the aspects to extend VAEs to handle multiaspect data, and also allow us to utilize any available data. In particular, we propose a factorization of the encoder and decoder mappings over the aspects, and refer to the resulting model as factVAE. Our specification encourages each aspect to manifest itself on a sparse subset of the latent dimensions. The two critical insights are (i) a means of coherently combining aspectspecific encoders and (ii) ensuring consistent support on the latent space of each aspectspecific encoder/decoder pair. For the former, we leverage a product of experts framework for combining encoders inspired by the method of vedantam2017generative . The result is a variational distribution on the latent space whose entropy decreases as more aspects are included. For the latter, we incorporate shared sparsity on the weights defining the output of the encoder and input of the decoder; the sparsity is encouraged through a groupsparse prior xu2015bayesian .
Side benefits of our sparsity include being able to better handle limited amounts of data and inferring interpretable relationships between latent space activations and aspects, as in ainsworth2018interpretable . Additionally, the resulting sparsity pattern can be used to inform us of which aspects are most informative in capturing dimensions of variability in our data. Fundamentally, the proposed framework yields a disentangled latent space where each aspect is linked to a distinct set of latent components. Correlations between aspects are captured by the overlapping supports of the latent representations.
We demonstrate factVAE on a variety of datasets. First, we analyze motion capture sequences, where aspects correspond to groups of joints forming limbs and the core. Next, we explore an image dataset consisting of multiple views of people’s heads, with each view corresponding to an aspect. Both datasets have a limited number of observations, a challenge for traditional VAE methods. Additionally, we simulate significant missing data by removing aspects (limbs for mocap and views for the image dataset), making traditional VAEs illsuited. (The image dataset has naturally occurring missingness, as well.) Our method provides stateoftheart reconstruction performance in these settings while also yielding interpretable latent spaces. The resulting model provides a unified and robust means for handling many types of multiaspect data, even in the presence of potential missingness.
2 factVAE: Disentangling VAEs with factorized mappings
We start with the standard VAE formulation of an observation embedded into a dimensional latent representation kingma2013auto . The VAE consists of two components, an encoder referred to as the inference network and a decoder referred to as the generator. The inference network provides a mapping from a given observation to a variational distribution
(1) 
on the latent space. Here, both the mean and (diagonal) covariance are defined using deep neural networks. The generator provides a mapping from a latent code, , to a distribution on observations defined as:
(2)  
(3) 
Here, the mean and (diagonal) covariance of the generator’s distribution on observations, and , respectively, are specified via deep neural networks with parameters .
When we are faced with multiaspect data—which could be a collection of multimodal data sources, multiple views, or data naturally decomposing into groups of observations—the VAE treats all dimensions jointly, attempting to a learn a complex inference network and generator oblivious to the underlying structure. Recently, the output interpretable VAE (oiVAE) ainsworth2018interpretable was proposed to handle such grouped data and leverage withingroup correlation structure and betweengroup sparse dependencies. One of the key goals is also to uncover interpretable relationships between the dimensions of the latent code and the observation groups. In particular, each latent dimension generates a sparse subset of the observation groups.
Formally, the oiVAE is specified as follows. Write as for some groups. The oiVAE defines groupspecific generators as follows:
(4)  
(5) 
Here,
is a groupspecific linear transformation between the latent representation
and the group generator . Critically, the latent representation is shared over all the groupspecific generators. A benefit of this formulation—beyond supporting groupspecific generators—is that one can interpret the relationships between groupspecific activations through the latent representation, just as in a standard linear latent factor model. To further aid in interpretability, and to better handle limited data scenarios (a situation that typically plagues standard VAEs), the oiVAE specification places a sparsityinducing prior Kyung:2010 on the columns of the latenttogroup matrix . When the th column of the weight matrix, , is all zeros then the th latent dimension, , will have no influence on group . In order to avoid learning small latenttogroup weights only to be reamplified by downstream network layers, a standard normal prior is also placed on the parameters of each generative network, .Although the oiVAE focuses the generator on groupstructured observations, providing both interpretability and an ability to handle more limited data scenarios, the framework cannot directly handle multimodal data sources or missing groups of observations. In particular, the inference network is the same
as in the standard VAE, treating all observations jointly. For multimodal data, one could imagine leveraging architectures deployed in other neural network situations, such as combining modalityspecific features extracted with an appropriate neural network model
ngiam2011multimodal ; srivastava2012multimodal . However, this entangles all of the groups into all of the latent dimensions, making it hard to distinguish which dimensions encode which modalities. Furthermore, this approach still cannot handle missing groups of observations. (Note that the oiVAE generator could straightforwardly handle multimodal data by defining different likelihoods on different groups.)We propose a fully factorized VAE (factVAE) that considers a groupwise factorization of the inference network, as well. FactVAE fully supports inference with missing aspects by utilizing an inference network that flexibly aggregates approximate posteriors only over the available groups. The most efficient information flow occurs when a given observation group only influences the dimensions of the variational distribution corresponding to those responsible for generating that observed group. (We assume the latter is sparse, following the oiVAE generator specification.) See Fig. 1. Two critical questions remain:

How do we combine these groupspecific inference networks into a coherent variational distribution?

How do we encourage consistent sparsity patterns to appear on the inference and generator sides?
These two questions must be answered jointly, since any design choice addressing 1. must be very careful not to violate the sparse grouplatent component relationship constraints from 2. It is worth noting that averaging on parameters (or equivalently layer activations) does not satisfy these criteria. Even if there are sparse connections between the groups and the parameters, the sparsity will introduce zeros into the average producing some influence on latent components that should be independent of a group, violating 2. For factVAE, we apply a product of experts (PoE) formulation in which each group assumes its own inference network, outputting its own approximate posterior . These individual variational distributions are aggregated by taking their product,
(6) 
See Fig. 1(right
). In particular we choose normal distributions for
and , allowing us to compute the in closed form.^{1}^{1}1The closed form solution is given by and . A product of normal experts has the convenient property that its entropy only decreases as more observations are included, meaning that our uncertainty monotonically decreases the more we observe. This is in contrast to other approaches which promote the opposite behavior. Our choice here is related to that made in vedantam2017generative . We further compare and contrast our formulation with alternatives in Section 3.Sparsity is respected by parameterizing each by a mean and diagonal precision matrix . By enforcing sparsity in latent components that are independent of the group
have infinite variance in
. That is, this group is completely uninformative of that latent dimension. We define where and . To have consistent encoder/decoder sparsity, we enforce the column sparsity pattern from to be the same as the row sparsity on . To this end, we stack and to produce and then apply group sparsity prior on the columns of . This allows us to simultaneously learn the sparsity pattern across both the inference and generative networks, allowing the model to learn which groups and latent components should interact throughout training.Many Bayesian sparsityinducing priors exist throughout the literature (xu2017bayesian, ; carvalho2009handling, ). A popular class of them are globallocal shrinkage priors. Despite the flexibility of many of these priors, they are not amenable to fast variational inference and do not recover exact zeros. Instead we use a hierarchical Bayesian grouplasso prior on the columns of in order to encourage entire columns to be shrunk to zero. The prior takes the following form Kyung:2010 :
(7)  
(8) 
where Gamma() is defined by shape and rate, and denotes the number of rows in each . The rate parameter defines the amount of sparsity, with larger implying more sparsity in the relationships between latent components and groups. Marginalizing over induces group sparsity over the columns of ; the MAP of the resulting posterior is equivalent to a group lasso penalized objective.
2.1 Optimizing the factVAE
We train the factVAE by adapting standard stochastic variational inference for VAEs to handle missing data. At training time we use the available groups to produce individual approximate posteriors which are aggregated through the product of experts formulation. With an estimate
in hand, we can sample through the model, producing a likelihood . Finally we can marginalize out groups that are unobserved, and evaluate the likelihood only on the observed groups.More specifically, we deploy collapsed variational inference over the scale parameters ,
(9) 
with the subtle difference that we have absorbed the inference networks’ parameters (previously ) into since the sparsity prior is shared across both parameters in inference and generative networks.
To optimize we alternately perform stochastic gradient steps on and proximal gradient descent on , since our hierarchical Bayesian prior admits an efficient proximal operator (ainsworth2018interpretable, ; Parikh:2013, ). See Algorithm 1 for pseudocode.
In our training of factVAE, for all of our experiments, we also handicap the inference network by randomly dropping out groups to promote robustness in the presence of missing data even when the dataset itself is complete. This has the added benefit of encouraging the model to learn to sample across groups, as opposed to simply performing joint reconstruction.^{2}^{2}2We will release code upon acceptance.
3 Related work
Although there are some previous works investigating deep generative models for multiaspect data, to our knowledge none learn a disentangled representation through sparsity or offer any results with more than a couple of groups (typically two). The conditional VAE (CVAE) (sohn2015learning, )
considered extending the VAE framework to handle conditional distributions between fixed input and output domains. As such, the CVAE only supports conditional sampling from the input to the output domains. However, recent work has begun to tackle the issue of modeling the joint distribution.
The joint multimodal VAE (JMVAE) (suzuki2016joint, ) effectively endows each group with its own inference network and combines them through a mixture distribution with uniform weights. The central idea is to learn two separate VAEs that share a latent representation and then use their individual inference networks at test time based on whichever group is present. The authors mention that extending the model to more than two groups is possible, although we found that it does require some adjustment. In particular, it is not clear how to reconstruct when a subset of groups are available, as opposed to just one. The Appendix of (vedantam2017generative, ) notes that the JMVAE model is equivalent to a uniform mixture distribution over the individual approximate posteriors: . We refer to this slight reinterpretation of the model as “JMVAE+”. We also investigated learning the mixture weights, but did not find that it produced any meaningful benefit. The JMVAE+ interpretation indicates an approach to handling inference with multiple available groups, but it brings along its own issues.
First, a mixture of Gaussians can only increase in entropy as more components are added, meaning that the JMVAE becomes less
certain as it receives more data. This is in contrast with factVAE which only ever decreases its uncertainty with the addition of new information. Furthermore, the mixture distribution indicates that a reconstruction from the model must be based on information from only one input modality. Therefore, JMVAE is incapable of fusing information across groups. Finally, there is no closed form for the Kullback–Leibler divergence that appears in the corresponding ELBO, requiring an additional lower bound.
A clever approach was taken in (vedantam2017generative, ) for learning visual models of both images and their binary features , eg. wearing_hat. (vedantam2017generative, ) also applied a product of experts (PoE) posterior approximation but only for the binary features. In total it had three distinct inference networks: , , and . While tractable for two groups, this means that extending the work to handle more than two groups would require an exponential blowup in the number of inference networks. For our Faceback example in Section 4.3, that would necessitate 127 networks! We take inspiration from their application of PoE for aggregating approximate posteriors from the inference networks, but we apply it across all modalities obviating the need for a combinatorial number of inference networks. In addition, we support missingness and emphasize sparsity in the learned representation.
An orthogonal direction was considered with the output intepretable VAE (oiVAE) (ainsworth2018interpretable, ). oiVAE aimed to model grouped observations with sparse, interpretable grouplatent component interactions but without considering missing or multimodal data. Specifically oiVAE introduced the concept of disentanglement via structured sparsity between the latent codes and the group generators. However, oiVAE only factorizes the model on the generative side, leaving the inference network oblivious to the multiaspect nature of the data. This makes handling missing data impossible, and leads to a less elegant approach when faced with different modalities that require their own inference network architectures.
Beyond neural networks, modeling multiaspect and multimodal data has been considered in a number of previous works. For instance, manifold relevance determination (MRD) (damianou2012manifold, ) attacks a similar problem, but in the context of Gaussian processes latent variable models (lawrence2004gaussian, ). Specifically MRD learns groupspecific (“private”) subspaces of the latent space and a global (“shared”) subspace. However, no prior is placed on the actual weights associating these subspaces to each of the generative models meaning that although this “private”/“shared” behavior may arise, it is not directly encouraged and will not exactly prune dimensions for each group. Though the framework could theoretically extend beyond two groups, the authors only considered two in their work. There are a number of important decisions to be made in extending this framework to more groups. We leave that to future work and focus on comparisons with VAEbased frameworks.
4 Experiments
4.1 Bars simulated data
To assess the ability of our model to handle simple structured data with well understood correlations, we created synthetic images with rows of pixels randomly activated in each (see Figure 2(left)). The image is split into its four quadrants, each one becoming a group. Since all the activity is horizontal, we should expect there to be latent components shared between the top two quadrants and the bottom two quadrants, but no shared components between the left two quadrants and the right two quadrants. In Figure 2(right) we can see that this is exactly the case when ; in contrast, when (akin to a VAE with factored networks) the sparsity pattern disappears completely. We can also see that the model has learned to successfully reconstruct the right half of images when presented with only the left half based on its learned structure across groups in the model.
4.2 Motion capture
In order to assess factVAE’s ability to recover interpretable sparsity in realworld data, we built and trained a factVAE model on motion capture data from CMU’s motion capture database. The results of this model are presented in Figures 3 and 4. We divided the skeleton into 5 major groups: the back and head (“core”), right arm, left arm, right leg, and left leg.
In Figure 4
we can clearly see that factVAE has successfully learned a sparse, disentangled representation for the data in which each latent component interacts with only a few of the groups. The exceptions to this are components 1 and 4 which we found encode the position in the overall stride of the subject’s walk. We quantitatively see the benefits of this learned representation in Table
1, where we compute heldout log likelihood of a walking sequence, comparing to our model without sparsity, as well as to the JMVAE+ model described in Section 3. The benefits of our framework are clear, but especially for limited training data scenarios.In Figure 3 we show reconstructions conditioning only on joints in the core, including the head and back. Although only subtle motions can be seen visually, factVAE has successfully learned that rotations in the spine and head are correlated with the remaining groups and can be used to produce very accurate reconstructions of the overall pose.
4.3 Faceback ^{3}^{3}3“You take a picture of anybody’s face, and it’ll show you what the back of his head looks like. Faceback!” – Will Ferrell and Eva Mendes in “The Other Guys”
In order to further test factVAE’s ability to handle rich, complex data across many different views, we trained and evaluated factVAE on a face reconstruction task. We gathered images from the CVL Face Database (http://www.lrv.fri.unilj.si/facedb.html) and treated each view as a group, and each subject as an example. The dataset contains images of 114 subjects presented in 7 different poses: 5 rotational, and 2 smiling. This dataset is especially interesting since it actually contains missing data; not all subjects have images for all 7 poses. We use the first 100 subjects for training and the remainder for testing. We used an adaptation of the DCGAN (radford2015unsupervised, ) architecture for the inference and generative networks. See the Supplement for more details. Results are shown in Figure 5.
We evaluated the model by reconstructing each of the 7 views based on the other available views. All of the results presented are reconstructed images. As we can see, the factVAE is able to clearly reconstruct the training images (blue box) with no trouble, and it successfully picks up on salient features even on completely unseen inputs (orange box). For example, the final row shows the smilingwithteeth group, and we can see clearly that the model is able to generate smilingwithteeth images even for completely unseen subjects. We also include reconstructions for some subjects where the openmouth smiling image was missing in the CVL Face Database (orange box). Though they are noisy, the model clearly captures smiling with teeth.^{5}^{5}5The faces in this dataset were not centered in the images, leading to another source of reconstruction error.
5 Discussion
We proposed factVAE, a deep generative model that can handle multiaspect data and is robust to training and reconstructing with missing data. Traditional deep generative models either cannot handle missing data in the first place or only partially address the issue of reconstruction with arbitrary conditioning. On the other hand, factVAE addresses the problem from the ground up, factorizing both the inference and generative networks across groups and coherently aggregating information through a product of experts approximate posterior. We also incorporate shared sparsity in order to disentangle latent components and groups. As a consequence, we can apply factVAE successfully even in limited data scenarios and extract interpretable information.
We demonstrated that factVAE is able to recover true sparsity in our synthetic bars experiment. On realworld data, including motion capture sequences and pictures of faces, we found that factVAE performed quantitatively superior to other techniques in reconstruction metrics and qualitatively produces realistic samples even when conditioning on as little as a single group. Investigating the learned sparse grouplatent component relationships, we found that factVAE produces interpretable and meaningful modes of variation. Encouraged by these results, we are excited for further exploration of structure and sparsity in deep generative models, and investigating their impact in other application domains.
References
 [1] Samuel Ainsworth, Nicholas Foti, Adrian KC Lee, and Emily Fox. Interpretable vaes for nonlinear group factor analysis. arXiv preprint arXiv:1802.06765, 2018.

[2]
Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and
Russ Webb.
Learning from simulated and unsupervised images through adversarial
training.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 3, page 6, 2017.  [3] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [4] Ramakrishna Vedantam, Ian Fischer, Jonathan Huang, and Kevin Murphy. Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762, 2017.
 [5] Xiaofan Xu, Malay Ghosh, et al. Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936, 2015.

[6]
M. Kyung, J. Gill, M. Ghosh, and G. Casella.
Penalized regression, standard errors, and Bayesian lassos.
Bayesian Analysis, 5(2):369–411, 06 2010. 
[7]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng.
Multimodal deep learning.
In International Conference on Machine Learning, pages 689–696, 2011. 
[8]
N. Srivastava and R. R. Salakhutdinov.
Multimodal learning with deep boltzmann machines.
In Advances in neural information processing systems, pages 2222–2230, 2012.  [9] Zemei Xu, Daniel F Schmidt, Enes Makalic, Guoqi Qian, and John L Hopper. Bayesian sparse globallocal shrinkage regression for grouped variables. arXiv preprint arXiv:1709.04333, 2017.
 [10] Carlos M Carvalho, Nicholas G Polson, and James G Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009.
 [11] N. Parikh and S. Boyd. Proximal algorithms, 2013.
 [12] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 [13] Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
 [14] Andreas Damianou, Carl Ek, Michalis Titsias, and Neil Lawrence. Manifold relevance determination. arXiv preprint arXiv:1206.4610, 2012.

[15]
Neil D Lawrence.
Gaussian process latent variable models for visualisation of high dimensional data.
In Advances in neural information processing systems, pages 329–336, 2004.  [16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.