Traditionally classification tasks can be divided in supervised learning, if we have fully labelled data; or semi-supervised learning, if that labelling is sparse. Or we can perform clustering via unsupervised learning if there is no labelled data at all. Semi-supervised learning is common as in many problem domains we do have some labelled data and the amount of unlabelled data is much larger than that of labelled data.
However in reality within our sparsely-labelled dataset there may be classes of data that are entirely unlabelled. That is, there are no labelled exemplars of them, only unlabelled instances. This can be due to selection bias, where the labelled data is from a biased sample of the overall data distribution. And rare class categories might be entirely unobserved in the labelled dataset, only appearing in unlabelled data.
A hypothetical example of an dataset of this type is as follows. Consider a set of medical images, such as scans of tumours. We then obtain ground true labels giving the variety of tumour for some small proportion of all the scans we have. Imagine that we do not happen to capture all distinct types of tumours in this smaller labelled dataset. An unlabelled image could be from one of the varieties that is captured in the labelled dataset, or it could be of another variety. We will not be in the semi-supervised regime, but nor do we want to treat the problem as unsupervised and discard our limited label data.
Naïvely applying semi-supervised learning algorithms to this data will result in attributing all data in the test set merely to the classes represented in the labelled dataset. If we attempt to solve this problem by expanding the dimensionality of the discrete label-space, we find that for some deep probabilistic generative semi-supervised learning algorithms the model will not make good use of these additional components. Classes of data found only in the unlabelled dataset not separated out. This is because these models can not perform clustering, even when the clustering is a sub-problem. For those classes of data not found in the labelled training set, we must perform unsupervised learning.
We are interested in this case, where an unlabelled instance of data could be from one of the sparsely-labelled classes or from an entirely-unlabelled class. We call this semi-unsupervised learning. Here we are jointly performing semi-supervised learning on sparsely-labelled classes, and unsupervised learning on completely unlabelled classes. This requires a model that can learn successfully in both unsupervised and semi-supervised regimes. We build two new models out of two previous deep generative models proposed for semi-supervised learning. The new models can learn in the unsupervised case as well as in the semi-supervised case.
Semi-unsupervised learning has similarities to some varieties of zero-shot learning (ZSL), where deep generative models have been of interest Weiss et al. (2016)
, but in zero-shot learning one has access to auxiliary side information (commonly an ‘attribute vector’) for data at training time, which we do not. Our regime is thus related to transductive generalised ZSL, but with no side informationXian et al. (2018)
. It also has links to transfer learningCook et al. (2013). We are solving two tasks simultaneously. The first is learning our model semi-supervised over the sparsely-labelled classes. The second is performing clustering over the entirely unlabelled classes. The representations in the learnt continuous latent space are shared between the two tasks of semi-supervised learning and clustering. However, we do not have the usual separation between ‘source’ and ‘target’ domains in our problem specification. Because an unlabelled data point could be from any class, be it one represented in the ultra-sparse labelled dataset or one not found there, we are learning to perform these tasks jointly from all available data in one campaign.
We here extend two semi-supervised deep generative models to enable them to learn in the unsupervised case. We do this by enforcing a mixture model in their continuous latent space. Our models are called GM-DGM and AGM-DGM.
We then demonstrate them in both semi-unsupervised and unsupervised regimes. Our models show that we can learn in the semi-unsupervised case, with accuracy higher than if we treated the problem as either semi-supervised or unsupervised.
Further, one of our models achieves state of the art clustering on MNIST for probabilistic deep generative models.
2 Related Work
DGMs have been widely used for both unsupervised and semi-supervised learning. Numerous models build on Variational Autoencoders (VAEs)Kingma & Welling (2013); Rezende et al. (2014), a variety of deep generative model. For semi-supervised learning, there are the aforementioned models M2 Kingma et al. (2014) and ADGM Maaløe et al. (2016). Further, Maaløe et al. (2016) also propose the Skip Deep Generative Model (SDGM) which shows superior performance to an ADGM on some datasets.
For clustering, both VaDE Jiang et al. (2017) and GM-VAE Dilokthanakul et al. (2017) extend VAEs with some form of mixture model in their learnt, continuous latent space. VaDE has the same forward model as the first model we will propose, but it uses Bayes’ rule to define its classifier/variational posterior over labels, rather than having a separate network parameterising it. The GM-VAE has a mixture of Gaussians in one of its stochastic layers, where this mixture is conditioned on another stochastic variable. The Cluster-aware Generative Model (CaGeM) Maaløe et al. (2017) can, like our models, learn in both unsupervised and semi-supervised regimes. However, the model’s performance at clustering data into components corresponding to ground-truth classes is not given.
Generative Adversarial Networks (GANs) have also been used to approach semi-supervised learning and clustering. Categorical Generative Adversarial Networks (CatGANs) Springenberg (2016) can learn in both regimes. The adversarial autoencoder (AAE) Makhzani et al. (2016), combining a GAN with a probabilistic model, can also learn in both regimes.
2.1 Improvements to VAE-derived models
There are various interlinking avenues that can be taken to improve the performance of VAE-like models. We list them in Appendix A. We do not avail ourselves of that work for this paper, so as to clearly isolate the effect of changing the probabilistic structure of the generative model of our DGMs.
3 Deep Generative Models
There are different varieties of deep generative models: fully probabilistic deep generative models, which define a valid joint distribution over both observed data and latent variables; and other types that do not. We choose to pursue the first type for the tasks at hand. In these models, we have a probabilistic graphical model where the parameters of the distributions within that model are themselves parameterised by neural networks. Due to the coherency of probabilistic modelling, these models we can handle missing observations in a principled way. Further, within this framework we can perform partial conditioning to obtain distributions of importance to us.
3.1 Variational Auto-Encoder
The simplest deep generative model of this type is a variational autoencoder Rezende et al. (2014); Kingma & Welling (2013), the deep version of factor analysis. Here there is a continuous unobserved latent and observed data
. The joint probability iswith and where are each parameterised by neural networks with parameters . As exact inference for is intractable, it is standard to perform stochastic amortised variational inference to obtain an approximation to the true posterior Rezende et al. (2014).
To obtain a VAE, introduce a recognition network (where are neural networks with parameters ). Through joint optimisation overthat maximises the evidence lower bound . For the expectation over in we take Monte Carlo (MC) samples. To take derivatives through these samples wrt use the ‘reparameterisation trick’, rewriting a sample from a Gaussian as a deterministic function of a sample from :
thus we can differentiate a sample w.r.t. , so we can differentiate our MC approximation w.r.t .
3.2 Semi-supervised Learning with Deep Generative Models
To perform semi-supervised classification with a deep generative model, it is necessary to introduce a sometimes-observed discrete class variable into the generative model and into the recognition networks. For a semi-supervised data the evidence lower bound for the model consists of two terms, one for our unlabelled data where is a latent variable to be inferred:
and the other for our labelled data where is observed:
First let us consider the model M2 from Kingma et al. (2014). Here:
and is the discrete prior on . Via simple manipulation one can show that .
Note that , which is to be our trained classifier at the end, only appears in , so it would only be trained on unlabelled data. To remedy this for M2 in Kingma et al. (2014)
, motivated by considering a Dirichlet hyperprior on, we add to the loss the cross entropy between the one-hot ground-truth label and , weighted by a factor . So the overall objective with unlabelled data and labelled data is the sum of the evidence lower bounds for all data plus this classification loss:
Eq 8 is of particular importance to us as it is also the evidence lower bound for semi-unsupervised learning. There however, the draws can have a corresponding class that is never observed in .
3.3 Auxiliary Deep Generative Models
Agakov & Barber (2004) introduce a method to get a richer variational distribution. They add an additional latent variable which enters the joint distribution as a conditional distribution given all other variables of the model:
By construction the original model is obtained when is marginalised out: . This auxiliary latent variable enables us to define a more expressive set of variational distributions.
3.3.1 Motivation for Auxiliary Variables
By adding the auxiliary variable we can now obtain a richer family of mappings between , and in our variational posterior, due to their connection through . Consider the approximate marginal distributions for the latent variables in this model, both for when is latent and when is observed:
3.3.2 Combining Auxiliary Variables with DGMs
Inserting into model M2 gives us the semi-supervised Auxiliary Deep Generative Model (ADGM) of Maaløe et al. (2016):
Here is Gaussian or Bernoulli depending on if is continuous or discrete. The inference model is:
As in 3.2 we have two forms for the evidence lower bound. When is latent:
and when is observed:
4 Posterior Collapse in Unsupervised and Semi-unsupervised Learning
For both models, M2 and ADGM, when there is no label data at all, when we are just optimising , the model can fail to learn an informative distribution for (this effect is also discussed in Dilokthanakul et al. (2017)). We have found that this collapse also takes place over the subspace of unlabelled classes when carrying out semi-unsupervised learning: often collapses to the prior .
The equivalent posterior collapse observed in is well studied, see Burda et al. (2016); Kingma et al. (2016); Maaløe et al. (2017); Sønderby et al. (2016); Chen et al. (2016). The variational posterior for matches closely the prior on
. This is most acute when the encoder and decoder networks are highly expressive, for instance a deep autoregressive model such as a PixelRNN/CNNvan den Oord et al. (2016a, b).
To understand posterior collapse in , let us write out Eq 2 for M2 in an expanded form:
We can see that in maximising we are minimising the divergence between and . Commonly is taken to be uniform, and we find that becomes uniform too during training. However, it seems plausible that the local minima associated with having could be left by achieving better reconstruction (maximising ) through an informative, non-degenerate .
Shu (2016) sheds further light on the posterior collapse of for Model M2. Reiterating the arguments, consider the variational posterior for in M2:
If is a confident classifier, that is is low, then is dominated by one Gaussian component. In the limit of a highly confident classifier this will reduce to a Gaussian distribution. However, if outputs gives a high entropy distribution, then is a mixture of Gaussians with as many components as the number of classes . The claim is that this richer variational distribution, a mixture obtained by having an uninformative classifier, enables M2 to minimise its evidence lower bound better than having separate Gaussians, with a particular one chosen for a given by an informative classifier.
Shu (2016) also notes that the generative model for M2 is a mixture model ‘in disguise’. When we implement in some neural network library, practitioners commonly choose for the neural network to have its input layer take the concatenation of and . Consider , the product of the weight matrix of first layer of the neural network with its concatenated input. Writing the weight matrix as two blocks:
Recalling that is a one-hot vector and is drawn from an isotropic unit Gaussian, this gives us a mixture of Gaussians in . Even though we have in
one of the most common statistical models, a Gaussian mixture model, we do not treat it in a special manner to leverage this fact. This motivatesShu (2016) to present a model for unsupervised learning called the real-GM-VAE, that explicitly models as a Gaussian mixture conditioned on . It has the same forward model as VaDE Jiang et al. (2017), but uses the variational distributions of M2.
These arguments carry over to ADGMs as well, as it has the same forward probabilistic model of M2 up to the addition of an auxiliary variable that does not interfere with the relevant structure.
5 Extending Deep Generative Models to Avoid Posterior Collapse in
5.1 Gaussian mixtures in the continuous latent space in Deep Generative Models
We propose two models, one extending from M2/real-GM-VAE of Shu (2016) and one from the ADGM of Maaløe et al. (2016), that can handle both unsupervised and semi-supervised learning, thus enabling us to capture semi-unsupervised learning. We call these models the Gaussian Mixture Deep Generative Model (GM-DGM) and the Auxiliary Gaussian Mixture Deep Generative Model (AGM-DGM) respectively. The generative model structure of each has a mixture of Gaussians in conditioned on . We demonstrate both the GM-DGM and AGM-DGM in unsupervised and semi-unsupervised settings, for MNIST and Fashion-MNIST.
5.2 Gaussian Mixture DGM
5.3 Auxiliary Gaussian Mixture DGM
Here we move to a mixture of Gaussians in for the ADGM, to obtain the Auxiliary Gaussian Mixture deep generative model, or AGM-DGM. The generative model for the data is:
See Fig. 2 for a graphical representation of our model.
5.4 Classifier for models with Auxiliary Variables
For the ADGM/AGM-DGM we approximate the classifier as:
Also in our objectives we have expectations over . Using the above approximation to for the auxiliary models, we then perform those calculations exactly.
In our experiments we perform:
1) semi-unsupervised learning on Fashion-MNIST for both our models and the models M2 and ADGM as a baseline
2) clustering on Fashion-MNIST and MNIST for both our models, comparing against various published baselines.
For the these experiments we must specify the dimensionality of , and the prior . For both unsupervised and semi-unsupervised runs we augment , adding extra classes to the ground truth classes we know exist. So . In the unsupervised case we put a uniform prior on .
6.1 Masking and Sparsity for Semi-unsupervised Experiments
For our semi-unsupervised runs on Fashion-MNIST, we mask out the labels of certain classes entirely and give sparse labels for the remaining classes. We mask out classes, , and the remaining classes are sparsely-labelled. We kept of labels for in the training data. For Fashion-MNIST with 60,000 training points this is 1200 labelled examples for each such class.
We divide up our prior evenly between the ‘vacated’ classes and the classes we are adding. This means we have:
For our semi-unsupervised results, we gave of the training data with labels for each semi-supervised class. For Fashion-MNIST with 60,000 training points this is 1200 labelled examples for each such class.
6.2 Dimensionality of
In our models we choose a relatively large dimensionality for . Partly this is because of the well-known challenge of posterior collapse in , as mentioned in Sec 4, which means that only a proportion of units in the networks representing give outputs significantly different to the prior. But also because in these mixture models
encodes both ‘style’ and class information, through the mean and variance of the forward model. In AppendixC we discuss the evidence for this.
6.3 Evaluating performance on unlabelled classes
As is common, to evaluate clustering performance on the labelled test-set, we follow a test-time cluster-and-label approach: we attribute the learnt, unsupervised classes to the most common ground truth class within it at test time. From this we can calculate accuracy. As discussed above, the GM-DGM has previously been proposed for clustering.
To show that we achieve good separation in between classes in these models when trained unsupervised, we do a two-dimensional t-SNE embedding Van Der Maaten & Hinton (2008) of the mean of over the test set, coloured by ground truth class, Figure 3. This is from a GM-DGM with , the same run as used to make Figures C.5, C.6 in the appendix.
7.2 Semi-Unsupervised Runs
7.2.1 Fashion-MNIST Results
We show the results after semi-unsupervised learning for both our models and for models M2 and ADGM in Table 1. Recall that we are masking out all labels for classes 0-4, and keeping of labels for classes 5-9. We can see that M2 and ADGM do not perform well within the unsupervised sub-problem. Our models do learn, and have similar accuracy for semi-supervised and unsupervised classes.
Figure 4 compares the confusion matrices for models M2 and AGM-DGM. We see here M2’s posterior collapse in the subspace of unsupervised classes in , and that our AGM-DGM model avoids this collapse. Instead it performs accurate classification of the test set with a learnt that is confident for classes where the training data was from either unsupervised or semi-supervised classes. M2 and ADGM give similar confusion matrices to each other, as do GM-DGM and AGM-DGM. We note that our explicit mixture models have superior performance overall.
7.3 Unsupervised Runs
7.3.1 Fashion-MNIST Results
See Table 2 for clustering performance of our models. As Fashion-MNIST is a relatively new dataset, there are not the same baselines available. This does not matter for us, as the primary purpose of these results to compare them to the performance obtained on the unsupervised-subproblem within semi-unsupervised learning.
For our models we see that the accuracy obtained on the unsupervised classes here is lower than the accuracy on the unsupervised classes within semi-unsupervised learning in Table 1
for the same model hyperparameters.
|DEC Guo (2018)||61.8|
7.3.2 MNIST Results
Table 3 presents the clustering performance of our models compared against published baselines. We have divided the baselines into blocks by type of model. The first is VAE-like DGMs, the second is GAN-like DGMs and the third is other deep clustering algorithms, with the best in each block in bold. For the ADM-DGM, training was unstable for some settings, sometimes showing posterior collapse in . The best run was from an ADG-DGM with accuracy of 97.9 () was better than any from GM-DGM.
Our results show we can reach test set accuracy which is state-of-the-art clustering of MNIST for probabilistic deep generative models, but we do not outperform IMSAT Hu et al. (2017) and ACOL-GAR Kilinc & Uysal (2018) at this task.
|VaDE Jiang et al. (2017)|
|GMVAE Dilokthanakul et al. (2017)||83.2 3.8|
|GMVAE Dilokthanakul et al. (2017)||87.8 5.3|
|GMVAE Dilokthanakul et al. (2017)||92.8 1.6|
|CatGAN Maaløe et al. (2017)||90.3|
|AAE Makhzani et al. (2016)||90.5 2.2|
|AAE Makhzani et al. (2016)||1.1|
|IMSAT Hu et al. (2017)||0.4|
|DEC Xie et al. (2016)||84.3|
|JULE Yang et al. (2016)||96.1|
|ACOL-GAR Kilinc & Uysal (2018)||98.3 0.1|
We introduced semi-unsupervised learning, a regime that requires a model to be able to cluster while also learning from semi-supervised data.
We presented two models that can do this, each made by changing previous DGMs so as to enable them to perform both clustering and semi-supervised learning jointly.
We do this by making DGMs that explicitly have a mixture model in their latent space. We have demonstrated that our models can learn in the semi-unsupervised regime.
Both the unsupervised and semi-supervised classes show increases in accuracy. Of particular interest is that the accuracy obtained on the unsupervised classes within the semi-unsupervised problem is higher than when training the model entirely unsupervised. This tells us that in the has been internal transfer learning taking place between the representations of the semi-supervised and unsupervised classes, improving the results of the latter. The models are effectively leveraging the limited label data to improve the representations for all classes.
Despite the superior published semi-supervised performance of ADGM against M2, we do not see a significant difference between AGM-DGM and GM-DGM.
Further work could be to add to these models the various improvements available, as listed in Appendix A. Also other DGMs than can learn in both required regimes could be explored under semi-unsupervised learning, such as GAN-based methods like CatGANs and AAEs. We hope that this new learning regime is further studied, as algorithms that can work within it could be of great use.
We would like to thank Raza Habib, Aiden Doherty and Rui Shu for their useful discussion. We also thank Miguel Morin for help with running experiments.
- Agakov & Barber (2004) Agakov, F. V. and Barber, D. An Auxiliary Variational Method. In NeurIPS, 2004.
- Alemi et al. (2018) Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a Broken ELBO. ICML, 2018. URL http://proceedings.mlr.press/v80/alemi18a/alemi18a.pdf.
- Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance Weighted Autoencoders. In ICLR, 2016. URL https://arxiv.org/pdf/1509.00519.pdf.
- Chen et al. (2016) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational Lossy Autoencoder. In ICLR, 2016. URL https://arxiv.org/pdf/1611.02731.pdf.
- Cook et al. (2013) Cook, D., Feuz, K. D., and Krishnan, N. C. Transfer learning for activity recognition: A survey. Knowledge and Information Systems, 36(3):537–556, 2013. ISSN 02191377. doi: 10.1007/s10115-013-0665-3. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3768027/pdf/nihms490006.pdf.
- Dilokthanakul et al. (2017) Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep Unsupervised Clustering with Gaussian Mixture VAE. CoRR, 2017. URL https://arxiv.org/pdf/1611.02648v2.pdf.
- Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. MADE: Masked Autoencoder for Distribution Estimation Google DeepMind. In ICML, 2015. URL https://arxiv.org/pdf/1502.03509.pdf.
- Gordon & Hernández-Lobato (2017) Gordon, J. and Hernández-Lobato, J. M. Bayesian Semisupervised Learning with Deep Generative Models. In ICML Workshop on Principled Approaches to Deep Learning, 2017. URL https://arxiv.org/pdf/1706.09751.pdf.
- Guo (2018) Guo, X. DEC, 2018. URL https://github.com/XifengGuo/DEC-keras.
- Hu et al. (2017) Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning Discrete Representations via Information Maximizing Self-Augmented Training. In ICML, 2017. URL https://arxiv.org/pdf/1702.08720.pdf.
- Jiang et al. (2017) Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In IJCAI, 2017. URL https://arxiv.org/pdf/1611.05148.pdf.
- Kilinc & Uysal (2018) Kilinc, O. and Uysal, I. Learning Latent Representations in Neural Networks for Clustering Through Pseudo Supervision and Graph-based activity Regularization. In ICLR, 2018.
- Kingma & Lei Ba (2015) Kingma, D. P. and Lei Ba, J. Adam: A Method for Stochastic Optimisation. In ICLR, 2015. URL https://arxiv.org/pdf/1412.6980.pdf.
- Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In NeurIPS, 2013. ISBN 1312.6114v10. doi: 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/1312.6114.
- Kingma et al. (2014) Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling, M. Semi-Supervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems (NIPS), 2014. URL https://arxiv.org/pdf/1406.5298.pdf.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving Variational Inference with Inverse Autoregressive Flow. In NeurIPS, 2016. ISBN 9781611970685. URL https://arxiv.org/pdf/1606.04934.pdf.
- Maaløe et al. (2016) Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary Deep Generative Models. In ICML, 2016. ISBN 9781424444205. doi: 10.1109/ICCV.2009.5459469. URL https://arxiv.org/pdf/1602.05473.pdf.
- Maaløe et al. (2017) Maaløe, L., Fraccaro, M., and Winther, O. Semi-Supervised Generation with Cluster-aware Generative Models. CoRR, 2017. URL https://arxiv.org/pdf/1704.00637.pdf.
- Makhzani et al. (2016) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversial Autoencoders. In ICLR, 2016. ISBN 0928-4931. doi: 10.1016/j.msec.2012.07.027. URL https://arxiv.org/pdf/1511.05644.pdf.
- Phuong et al. (2018) Phuong, M., Welling, M., Kushman, N., Tomioka, R., and Nowozin, S. The Mutual Autoencoder: Controlling Information in Latent Code Representations. Technical report, 2018.
- Pu et al. (2016) Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., and Carin, L. Variational Autoencoder for Deep Learning of Images, Labels and Captions. In NeurIPS, 2016. URL https://arxiv.org/pdf/1609.08976.pdf.
- Rainforth et al. (2018) Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter Variational Bounds are Not Necessarily Better, 2018. ISSN 1938-7228. URL https://arxiv.org/pdf/1802.04537.pdfhttp://arxiv.org/abs/1802.04537.
- Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational Inference with Normalizing Flows. In ICML, 2015. URL http://proceedings.mlr.press/v37/rezende15.pdf.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, 2014. URL https://arxiv.org/pdf/1401.4082.pdf.
- Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In ICML, 2015. URL http://proceedings.mlr.press/v37/salimans15.pdf.
- Shu (2016) Shu, R. Gaussian Mixture VAE: Lessons in Variational Inference, Generative Models, and Deep Nets, 2016. URL http://ruishu.io/2016/12/25/gmvae/.
- Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., Winther, O., and Dk, O. Ladder Variational Autoencoders. In NeurIPS, 2016. URL https://arxiv.org/pdf/1602.02282.pdf.
- Springenberg (2016) Springenberg, J. T. Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks. In ICLR, 2016. ISBN 0022-1007 (Print)r0022-1007 (Linking). URL https://arxiv.org/pdf/1511.06390.pdf.
- van den Oord et al. (2016a) van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. In ICML, 2016a. URL https://arxiv.org/pdf/1601.06759.pdf.
- van den Oord et al. (2016b) van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional Image Generation with PixelCNN Decoders. In NeurIPS, 2016b. URL http://arxiv.org/abs/1606.05328.
- van den Oord et al. (2017) van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural Discrete Representation Learning. NeurIPS, 2017. URL https://arxiv.org/pdf/1711.00937.pdf.
- Van Der Maaten & Hinton (2008) Van Der Maaten, L. and Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.
- Weiss et al. (2016) Weiss, K., Khoshgoftaar, T. M., and Wang, D. D. A survey of transfer learning. Journal of Big Data, 3(1), 2016. ISSN 21961115. doi: 10.1186/s40537-016-0043-6. URL https://core.ac.uk/download/pdf/81905331.pdf.
- Xian et al. (2018) Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly, 2018. ISSN 01628828. URL https://arxiv.org/pdf/1707.00600.pdf.
- Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. ICML, 2016. URL https://arxiv.org/pdf/1511.06335.pdf.
- Yang et al. (2016) Yang, J., Parikh, D., Batra, D., and Tech, V. Joint Unsupervised Learning of Deep Representations and Image Clusters. CVPR, 2016. URL https://arxiv.org/pdf/1604.03628.pdf.
Appendix A Improvements to VAE-derived models
Secondly, there are numerous methods for improving the expressiveness of the posterior latent distribution using a normalising flow Rezende & Mohamed (2015), such as an inverse autroregressive flow Kingma et al. (2016).
Thirdly, one can have instead a discrete latent variable, such as that in the VQ-VAE van den Oord et al. (2017) which is acquired from looking-up a differentiable embedding.
And finally, it is possible to improve the performance of deep models in general by using more sophisticated networks. Restricting ourselves to VAE-based models: CNNs, such as used in Dilokthanakul et al. (2017); Kilinc & Uysal (2018); Salimans et al. (2015); ResNets Kingma et al. (2016)
; and recurrent neural networks (RNNs)Pu et al. (2016) can all be used. Variational Lossy Autoencoders Chen et al. (2016) give us a principled way to combine autoregressive decoders, such as an RNN, PixelRNN/CNN van den Oord et al. (2016a, b), or MADE Germain et al. (2015) with VAEs without the decoder modelling all latent structure internally. Relatedly, there have been recent advances in understanding the collapse of the variational posterior when using such powerful networks, through considering the mutual information between and Phuong et al. (2018); Alemi et al. (2018).
Appendix B Model Implementation and Data Preprocessing
Each distribution is parameterised by a neural network. Networks are small MLPs with two hidden layers, but for
which has no hidden layers: it simply maps from a one-hot encoding ofto and to . For networks representing Gaussian distributions, for M2/GM-DGM and for ADGM/AGM-DGM, the networks for and are shared up to the second hidden layer, each having its own output layer.
Between models, identical model architectures were used for networks with the same inputs and outputs. Kernel initialisation was from a Gaussian distribution with standard deviation of. Biases were initialised with zeros. Weights were regularised via a Gaussian prior as in Kingma et al. (2014). Our code is based on the template code associated with Gordon & Hernández-Lobato (2017).
We perform stochastic gradient descent to maximise the objectives in each case. We used Adam Kingma & Lei Ba (2015)
, with default moment parameters and a learning rate of. For the objectives of these models we must approximate the various expectations taken with respect to which we do using the reparameterisation trick, taking
samples in each case. The batch size was 100 for both labelled and unlabelled data. We trained for up to 2000 epochs.
For both MNIST and Fashion-MNIST we kept only dimensions of the data with a standard deviation greater than 0.1, and treated the resulting masked greyscale images as representing independent Bernoulli distributions. We then binarised this data at train time, taking a draw to represent each image in a batch.
Appendix C Activity of units
In Figure C.5 below we show the average over the test set of:
Here , and is indexing over nodes in . We do this for a GM-DGM trained unsupervised with , broken out by learnt, unsupervised class, and line colouring indicating the ground-truth class.
Even though we have units, the same small number () have non-zero . Further, in most dimensions the distributions of the cluster components in the generative model are very similar. We can test this via the
divergences between each ordered pair ofdistributions. See them plotted in Figure below. Other than the same or so units which we can see are active in Figure C.5, the divergences are small. From this we can see that, for the remaining units, is essentially the same over all classes. This would makes sense if the active, and class-varying, units are encoding class information, and the remaining units with the same distribution, regardless of class, are encoding ‘style’ information.
We find that with both GM-DGM and AGM-DGM we get better performance by increasing the number of extra dimensions in by , and having the dimensionality of be 200 or 300.