1 Introduction
Traditionally classification tasks can be divided in supervised learning, if we have fully labelled data; or semisupervised learning, if that labelling is sparse. Or we can perform clustering via unsupervised learning if there is no labelled data at all. Semisupervised learning is common as in many problem domains we do have some labelled data and the amount of unlabelled data is much larger than that of labelled data.
However in reality within our sparselylabelled dataset there may be classes of data that are entirely unlabelled. That is, there are no labelled exemplars of them, only unlabelled instances. This can be due to selection bias, where the labelled data is from a biased sample of the overall data distribution. And rare class categories might be entirely unobserved in the labelled dataset, only appearing in unlabelled data.
A hypothetical example of an dataset of this type is as follows. Consider a set of medical images, such as scans of tumours. We then obtain ground true labels giving the variety of tumour for some small proportion of all the scans we have. Imagine that we do not happen to capture all distinct types of tumours in this smaller labelled dataset. An unlabelled image could be from one of the varieties that is captured in the labelled dataset, or it could be of another variety. We will not be in the semisupervised regime, but nor do we want to treat the problem as unsupervised and discard our limited label data.
Naïvely applying semisupervised learning algorithms to this data will result in attributing all data in the test set merely to the classes represented in the labelled dataset. If we attempt to solve this problem by expanding the dimensionality of the discrete labelspace, we find that for some deep probabilistic generative semisupervised learning algorithms the model will not make good use of these additional components. Classes of data found only in the unlabelled dataset not separated out. This is because these models can not perform clustering, even when the clustering is a subproblem. For those classes of data not found in the labelled training set, we must perform unsupervised learning.
We are interested in this case, where an unlabelled instance of data could be from one of the sparselylabelled classes or from an entirelyunlabelled class. We call this semiunsupervised learning. Here we are jointly performing semisupervised learning on sparselylabelled classes, and unsupervised learning on completely unlabelled classes. This requires a model that can learn successfully in both unsupervised and semisupervised regimes. We build two new models out of two previous deep generative models proposed for semisupervised learning. The new models can learn in the unsupervised case as well as in the semisupervised case.
Semiunsupervised learning has similarities to some varieties of zeroshot learning (ZSL), where deep generative models have been of interest Weiss et al. (2016)
, but in zeroshot learning one has access to auxiliary side information (commonly an ‘attribute vector’) for data at training time, which we do not. Our regime is thus related to transductive generalised ZSL, but with no side information
Xian et al. (2018). It also has links to transfer learning
Cook et al. (2013). We are solving two tasks simultaneously. The first is learning our model semisupervised over the sparselylabelled classes. The second is performing clustering over the entirely unlabelled classes. The representations in the learnt continuous latent space are shared between the two tasks of semisupervised learning and clustering. However, we do not have the usual separation between ‘source’ and ‘target’ domains in our problem specification. Because an unlabelled data point could be from any class, be it one represented in the ultrasparse labelled dataset or one not found there, we are learning to perform these tasks jointly from all available data in one campaign.We here extend two semisupervised deep generative models to enable them to learn in the unsupervised case. We do this by enforcing a mixture model in their continuous latent space. Our models are called GMDGM and AGMDGM.
We then demonstrate them in both semiunsupervised and unsupervised regimes. Our models show that we can learn in the semiunsupervised case, with accuracy higher than if we treated the problem as either semisupervised or unsupervised.
Further, one of our models achieves state of the art clustering on MNIST for probabilistic deep generative models.
2 Related Work
DGMs have been widely used for both unsupervised and semisupervised learning. Numerous models build on Variational Autoencoders (VAEs)
Kingma & Welling (2013); Rezende et al. (2014), a variety of deep generative model. For semisupervised learning, there are the aforementioned models M2 Kingma et al. (2014) and ADGM Maaløe et al. (2016). Further, Maaløe et al. (2016) also propose the Skip Deep Generative Model (SDGM) which shows superior performance to an ADGM on some datasets.For clustering, both VaDE Jiang et al. (2017) and GMVAE Dilokthanakul et al. (2017) extend VAEs with some form of mixture model in their learnt, continuous latent space. VaDE has the same forward model as the first model we will propose, but it uses Bayes’ rule to define its classifier/variational posterior over labels, rather than having a separate network parameterising it. The GMVAE has a mixture of Gaussians in one of its stochastic layers, where this mixture is conditioned on another stochastic variable. The Clusteraware Generative Model (CaGeM) Maaløe et al. (2017) can, like our models, learn in both unsupervised and semisupervised regimes. However, the model’s performance at clustering data into components corresponding to groundtruth classes is not given.
Generative Adversarial Networks (GANs) have also been used to approach semisupervised learning and clustering. Categorical Generative Adversarial Networks (CatGANs) Springenberg (2016) can learn in both regimes. The adversarial autoencoder (AAE) Makhzani et al. (2016), combining a GAN with a probabilistic model, can also learn in both regimes.
Other deep clustering algorithms include IMSAT Hu et al. (2017), DEC Xie et al. (2016), JULE Yang et al. (2016) and ACOLGAR Kilinc & Uysal (2018)
2.1 Improvements to VAEderived models
There are various interlinking avenues that can be taken to improve the performance of VAElike models. We list them in Appendix A. We do not avail ourselves of that work for this paper, so as to clearly isolate the effect of changing the probabilistic structure of the generative model of our DGMs.
3 Deep Generative Models
There are different varieties of deep generative models: fully probabilistic deep generative models, which define a valid joint distribution over both observed data and latent variables; and other types that do not. We choose to pursue the first type for the tasks at hand. In these models, we have a probabilistic graphical model where the parameters of the distributions within that model are themselves parameterised by neural networks. Due to the coherency of probabilistic modelling, these models we can handle missing observations in a principled way. Further, within this framework we can perform partial conditioning to obtain distributions of importance to us.
3.1 Variational AutoEncoder
The simplest deep generative model of this type is a variational autoencoder Rezende et al. (2014); Kingma & Welling (2013), the deep version of factor analysis. Here there is a continuous unobserved latent and observed data
. The joint probability is
with and where are each parameterised by neural networks with parameters . As exact inference for is intractable, it is standard to perform stochastic amortised variational inference to obtain an approximation to the true posterior Rezende et al. (2014).To obtain a VAE, introduce a recognition network (where are neural networks with parameters ). Through joint optimisation over
using stochastic gradient descent we aim to find the pointestimates of the parameters
that maximises the evidence lower bound . For the expectation over in we take Monte Carlo (MC) samples. To take derivatives through these samples wrt use the ‘reparameterisation trick’, rewriting a sample from a Gaussian as a deterministic function of a sample from :(1) 
thus we can differentiate a sample w.r.t. , so we can differentiate our MC approximation w.r.t .
3.2 Semisupervised Learning with Deep Generative Models
To perform semisupervised classification with a deep generative model, it is necessary to introduce a sometimesobserved discrete class variable into the generative model and into the recognition networks. For a semisupervised data the evidence lower bound for the model consists of two terms, one for our unlabelled data where is a latent variable to be inferred:
(2) 
and the other for our labelled data where is observed:
(3) 
First let us consider the model M2 from Kingma et al. (2014). Here:
(4)  
(5) 
where:
(6)  
(7) 
and is the discrete prior on . Via simple manipulation one can show that .
Note that , which is to be our trained classifier at the end, only appears in , so it would only be trained on unlabelled data. To remedy this for M2 in Kingma et al. (2014)
, motivated by considering a Dirichlet hyperprior on
, we add to the loss the cross entropy between the onehot groundtruth label and , weighted by a factor . So the overall objective with unlabelled data and labelled data is the sum of the evidence lower bounds for all data plus this classification loss:(8) 
Eq 8 is of particular importance to us as it is also the evidence lower bound for semiunsupervised learning. There however, the draws can have a corresponding class that is never observed in .
3.3 Auxiliary Deep Generative Models
Agakov & Barber (2004) introduce a method to get a richer variational distribution. They add an additional latent variable which enters the joint distribution as a conditional distribution given all other variables of the model:
(9) 
By construction the original model is obtained when is marginalised out: . This auxiliary latent variable enables us to define a more expressive set of variational distributions.
3.3.1 Motivation for Auxiliary Variables
By adding the auxiliary variable we can now obtain a richer family of mappings between , and in our variational posterior, due to their connection through . Consider the approximate marginal distributions for the latent variables in this model, both for when is latent and when is observed:
(10)  
(11) 
3.3.2 Combining Auxiliary Variables with DGMs
Inserting into model M2 gives us the semisupervised Auxiliary Deep Generative Model (ADGM) of Maaløe et al. (2016):
(12) 
where:
(13)  
(14)  
(15)  
(16)  
Here is Gaussian or Bernoulli depending on if is continuous or discrete. The inference model is:
(17) 
where:
(18)  
(19)  
(20) 
As in 3.2 we have two forms for the evidence lower bound. When is latent:
(21) 
and when is observed:
(22) 
4 Posterior Collapse in Unsupervised and Semiunsupervised Learning
For both models, M2 and ADGM, when there is no label data at all, when we are just optimising , the model can fail to learn an informative distribution for (this effect is also discussed in Dilokthanakul et al. (2017)). We have found that this collapse also takes place over the subspace of unlabelled classes when carrying out semiunsupervised learning: often collapses to the prior .
The equivalent posterior collapse observed in is well studied, see Burda et al. (2016); Kingma et al. (2016); Maaløe et al. (2017); Sønderby et al. (2016); Chen et al. (2016). The variational posterior for matches closely the prior on
. This is most acute when the encoder and decoder networks are highly expressive, for instance a deep autoregressive model such as a PixelRNN/CNN
van den Oord et al. (2016a, b).To understand posterior collapse in , let us write out Eq 2 for M2 in an expanded form:
(24) 
We can see that in maximising we are minimising the divergence between and . Commonly is taken to be uniform, and we find that becomes uniform too during training. However, it seems plausible that the local minima associated with having could be left by achieving better reconstruction (maximising ) through an informative, nondegenerate .
Shu (2016) sheds further light on the posterior collapse of for Model M2. Reiterating the arguments, consider the variational posterior for in M2:
(25) 
If is a confident classifier, that is is low, then is dominated by one Gaussian component. In the limit of a highly confident classifier this will reduce to a Gaussian distribution. However, if outputs gives a high entropy distribution, then is a mixture of Gaussians with as many components as the number of classes . The claim is that this richer variational distribution, a mixture obtained by having an uninformative classifier, enables M2 to minimise its evidence lower bound better than having separate Gaussians, with a particular one chosen for a given by an informative classifier.
Shu (2016) also notes that the generative model for M2 is a mixture model ‘in disguise’. When we implement in some neural network library, practitioners commonly choose for the neural network to have its input layer take the concatenation of and . Consider , the product of the weight matrix of first layer of the neural network with its concatenated input. Writing the weight matrix as two blocks:
(26) 
(27) 
Recalling that is a onehot vector and is drawn from an isotropic unit Gaussian, this gives us a mixture of Gaussians in . Even though we have in
one of the most common statistical models, a Gaussian mixture model, we do not treat it in a special manner to leverage this fact. This motivates
Shu (2016) to present a model for unsupervised learning called the realGMVAE, that explicitly models as a Gaussian mixture conditioned on . It has the same forward model as VaDE Jiang et al. (2017), but uses the variational distributions of M2.These arguments carry over to ADGMs as well, as it has the same forward probabilistic model of M2 up to the addition of an auxiliary variable that does not interfere with the relevant structure.
5 Extending Deep Generative Models to Avoid Posterior Collapse in
5.1 Gaussian mixtures in the continuous latent space in Deep Generative Models
We propose two models, one extending from M2/realGMVAE of Shu (2016) and one from the ADGM of Maaløe et al. (2016), that can handle both unsupervised and semisupervised learning, thus enabling us to capture semiunsupervised learning. We call these models the Gaussian Mixture Deep Generative Model (GMDGM) and the Auxiliary Gaussian Mixture Deep Generative Model (AGMDGM) respectively. The generative model structure of each has a mixture of Gaussians in conditioned on . We demonstrate both the GMDGM and AGMDGM in unsupervised and semiunsupervised settings, for MNIST and FashionMNIST.
5.2 Gaussian Mixture DGM
5.3 Auxiliary Gaussian Mixture DGM
Here we move to a mixture of Gaussians in for the ADGM, to obtain the Auxiliary Gaussian Mixture deep generative model, or AGMDGM. The generative model for the data is:
(32) 
with generative networks as in Eqs 16, 2931 and inference networks as in Eqs 1720.
See Fig. 2 for a graphical representation of our model.
5.4 Classifier for models with Auxiliary Variables
For the ADGM/AGMDGM we approximate the classifier as:
(33) 
Also in our objectives we have expectations over . Using the above approximation to for the auxiliary models, we then perform those calculations exactly.
6 Experiments
In our experiments we perform:
1) semiunsupervised learning on FashionMNIST for both our models and the models M2 and ADGM as a baseline
2) clustering on FashionMNIST and MNIST for both our models, comparing against various published baselines.
We also restrict ourselves only to small dense neural networks. We describe our model implementation in detail and our data preprocessing method in Appendix B.^{1}^{1}1Code is on Github here.
For the these experiments we must specify the dimensionality of , and the prior . For both unsupervised and semiunsupervised runs we augment , adding extra classes to the ground truth classes we know exist. So . In the unsupervised case we put a uniform prior on .
Model  Overall  SemiSup:  UnSup  

Baselines  
M2  300  15  
ADGM  300  100  15  
Our Models  
GMDGM  200  15  
GMDGM  200  30  
GMDGM  300  15  
AGMDGM  200  100  15  
AGMDGM  300  100  15 
6.1 Masking and Sparsity for Semiunsupervised Experiments
For our semiunsupervised runs on FashionMNIST, we mask out the labels of certain classes entirely and give sparse labels for the remaining classes. We mask out classes, , and the remaining classes are sparselylabelled. We kept of labels for in the training data. For FashionMNIST with 60,000 training points this is 1200 labelled examples for each such class.
We divide up our prior evenly between the ‘vacated’ classes and the classes we are adding. This means we have:
(34)  
(35) 
For our semiunsupervised results, we gave of the training data with labels for each semisupervised class. For FashionMNIST with 60,000 training points this is 1200 labelled examples for each such class.
6.2 Dimensionality of
In our models we choose a relatively large dimensionality for . Partly this is because of the wellknown challenge of posterior collapse in , as mentioned in Sec 4, which means that only a proportion of units in the networks representing give outputs significantly different to the prior. But also because in these mixture models
encodes both ‘style’ and class information, through the mean and variance of the forward model. In Appendix
C we discuss the evidence for this.6.3 Evaluating performance on unlabelled classes
As is common, to evaluate clustering performance on the labelled testset, we follow a testtime clusterandlabel approach: we attribute the learnt, unsupervised classes to the most common ground truth class within it at test time. From this we can calculate accuracy. As discussed above, the GMDGM has previously been proposed for clustering.
7 Results
7.1 Visualising
To show that we achieve good separation in between classes in these models when trained unsupervised, we do a twodimensional tSNE embedding Van Der Maaten & Hinton (2008) of the mean of over the test set, coloured by ground truth class, Figure 3. This is from a GMDGM with , the same run as used to make Figures C.5, C.6 in the appendix.
7.2 SemiUnsupervised Runs
7.2.1 FashionMNIST Results
We show the results after semiunsupervised learning for both our models and for models M2 and ADGM in Table 1. Recall that we are masking out all labels for classes 04, and keeping of labels for classes 59. We can see that M2 and ADGM do not perform well within the unsupervised subproblem. Our models do learn, and have similar accuracy for semisupervised and unsupervised classes.
Figure 4 compares the confusion matrices for models M2 and AGMDGM. We see here M2’s posterior collapse in the subspace of unsupervised classes in , and that our AGMDGM model avoids this collapse. Instead it performs accurate classification of the test set with a learnt that is confident for classes where the training data was from either unsupervised or semisupervised classes. M2 and ADGM give similar confusion matrices to each other, as do GMDGM and AGMDGM. We note that our explicit mixture models have superior performance overall.
7.3 Unsupervised Runs
7.3.1 FashionMNIST Results
See Table 2 for clustering performance of our models. As FashionMNIST is a relatively new dataset, there are not the same baselines available. This does not matter for us, as the primary purpose of these results to compare them to the performance obtained on the unsupervisedsubproblem within semiunsupervised learning.
For our models we see that the accuracy obtained on the unsupervised classes here is lower than the accuracy on the unsupervised classes within semiunsupervised learning in Table 1
for the same model hyperparameters.
Baseline Model  
Model  
DEC Guo (2018)  61.8  
Our Models  
Model  
GMDGM  200  15  
GMDGM  200  30  
GMDGM  300  15  
GMDGM  300  30  
AGMDGM  200  100  15  
AGMDGM  200  100  30  
AGMDGM  300  100  15  

7.3.2 MNIST Results
Table 3 presents the clustering performance of our models compared against published baselines. We have divided the baselines into blocks by type of model. The first is VAElike DGMs, the second is GANlike DGMs and the third is other deep clustering algorithms, with the best in each block in bold. For the ADMDGM, training was unstable for some settings, sometimes showing posterior collapse in . The best run was from an ADGDGM with accuracy of 97.9 () was better than any from GMDGM.
Our results show we can reach test set accuracy which is stateoftheart clustering of MNIST for probabilistic deep generative models, but we do not outperform IMSAT Hu et al. (2017) and ACOLGAR Kilinc & Uysal (2018) at this task.
Baseline Models  
Model  
VaDE Jiang et al. (2017)  
GMVAE Dilokthanakul et al. (2017)  83.2 3.8  
GMVAE Dilokthanakul et al. (2017)  87.8 5.3  
GMVAE Dilokthanakul et al. (2017)  92.8 1.6  
CatGAN Maaløe et al. (2017)  90.3  
AAE Makhzani et al. (2016)  90.5 2.2  
AAE Makhzani et al. (2016)  1.1  
IMSAT Hu et al. (2017)  0.4  
DEC Xie et al. (2016)  84.3  
JULE Yang et al. (2016)  96.1  
ACOLGAR Kilinc & Uysal (2018)  98.3 0.1  
Our Models  
Model  
GMDGM  200  15  91.1 3.5  
GMDGM  200  30  0.9  
GMDGM  300  15  94.5 1.2  
GMDGM  300  30  95.7 0.9  
AGMDGM  200  100  15  90.0 3.9 
AGMDGM  200  100  30  95.7 

8 Conclusion
We introduced semiunsupervised learning, a regime that requires a model to be able to cluster while also learning from semisupervised data.
We presented two models that can do this, each made by changing previous DGMs so as to enable them to perform both clustering and semisupervised learning jointly.
We do this by making DGMs that explicitly have a mixture model in their latent space.
We have demonstrated that our models can learn in the semiunsupervised regime.
Both the unsupervised and semisupervised classes show increases in accuracy. Of particular interest is that the accuracy obtained on the unsupervised classes within the semiunsupervised problem is higher than when training the model entirely unsupervised. This tells us that in the has been internal transfer learning taking place between the representations of the semisupervised and unsupervised classes, improving the results of the latter. The models are effectively leveraging the limited label data to improve the representations for all classes.
Despite the superior published semisupervised performance of ADGM against M2, we do not see a significant difference between AGMDGM and GMDGM.
Further work could be to add to these models the various improvements available, as listed in Appendix A. Also other DGMs than can learn in both required regimes could be explored under semiunsupervised learning, such as GANbased methods like CatGANs and AAEs. We hope that this new learning regime is further studied, as algorithms that can work within it could be of great use.
Acknowledgements
We would like to thank Raza Habib, Aiden Doherty and Rui Shu for their useful discussion. We also thank Miguel Morin for help with running experiments.
References
 Agakov & Barber (2004) Agakov, F. V. and Barber, D. An Auxiliary Variational Method. In NeurIPS, 2004.
 Alemi et al. (2018) Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a Broken ELBO. ICML, 2018. URL http://proceedings.mlr.press/v80/alemi18a/alemi18a.pdf.
 Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance Weighted Autoencoders. In ICLR, 2016. URL https://arxiv.org/pdf/1509.00519.pdf.
 Chen et al. (2016) Chen, X., Kingma, D. P., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational Lossy Autoencoder. In ICLR, 2016. URL https://arxiv.org/pdf/1611.02731.pdf.
 Cook et al. (2013) Cook, D., Feuz, K. D., and Krishnan, N. C. Transfer learning for activity recognition: A survey. Knowledge and Information Systems, 36(3):537–556, 2013. ISSN 02191377. doi: 10.1007/s1011501306653. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3768027/pdf/nihms490006.pdf.
 Dilokthanakul et al. (2017) Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep Unsupervised Clustering with Gaussian Mixture VAE. CoRR, 2017. URL https://arxiv.org/pdf/1611.02648v2.pdf.
 Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. MADE: Masked Autoencoder for Distribution Estimation Google DeepMind. In ICML, 2015. URL https://arxiv.org/pdf/1502.03509.pdf.
 Gordon & HernándezLobato (2017) Gordon, J. and HernándezLobato, J. M. Bayesian Semisupervised Learning with Deep Generative Models. In ICML Workshop on Principled Approaches to Deep Learning, 2017. URL https://arxiv.org/pdf/1706.09751.pdf.
 Guo (2018) Guo, X. DEC, 2018. URL https://github.com/XifengGuo/DECkeras.
 Hu et al. (2017) Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning Discrete Representations via Information Maximizing SelfAugmented Training. In ICML, 2017. URL https://arxiv.org/pdf/1702.08720.pdf.
 Jiang et al. (2017) Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. In IJCAI, 2017. URL https://arxiv.org/pdf/1611.05148.pdf.
 Kilinc & Uysal (2018) Kilinc, O. and Uysal, I. Learning Latent Representations in Neural Networks for Clustering Through Pseudo Supervision and Graphbased activity Regularization. In ICLR, 2018.
 Kingma & Lei Ba (2015) Kingma, D. P. and Lei Ba, J. Adam: A Method for Stochastic Optimisation. In ICLR, 2015. URL https://arxiv.org/pdf/1412.6980.pdf.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. AutoEncoding Variational Bayes. In NeurIPS, 2013. ISBN 1312.6114v10. doi: 10.1051/00046361/201527329. URL http://arxiv.org/abs/1312.6114.
 Kingma et al. (2014) Kingma, D. P., Rezende, D. J., Mohamed, S., and Welling, M. SemiSupervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems (NIPS), 2014. URL https://arxiv.org/pdf/1406.5298.pdf.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving Variational Inference with Inverse Autoregressive Flow. In NeurIPS, 2016. ISBN 9781611970685. URL https://arxiv.org/pdf/1606.04934.pdf.
 Maaløe et al. (2016) Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther, O. Auxiliary Deep Generative Models. In ICML, 2016. ISBN 9781424444205. doi: 10.1109/ICCV.2009.5459469. URL https://arxiv.org/pdf/1602.05473.pdf.
 Maaløe et al. (2017) Maaløe, L., Fraccaro, M., and Winther, O. SemiSupervised Generation with Clusteraware Generative Models. CoRR, 2017. URL https://arxiv.org/pdf/1704.00637.pdf.
 Makhzani et al. (2016) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversial Autoencoders. In ICLR, 2016. ISBN 09284931. doi: 10.1016/j.msec.2012.07.027. URL https://arxiv.org/pdf/1511.05644.pdf.
 Phuong et al. (2018) Phuong, M., Welling, M., Kushman, N., Tomioka, R., and Nowozin, S. The Mutual Autoencoder: Controlling Information in Latent Code Representations. Technical report, 2018.
 Pu et al. (2016) Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., and Carin, L. Variational Autoencoder for Deep Learning of Images, Labels and Captions. In NeurIPS, 2016. URL https://arxiv.org/pdf/1609.08976.pdf.
 Rainforth et al. (2018) Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter Variational Bounds are Not Necessarily Better, 2018. ISSN 19387228. URL https://arxiv.org/pdf/1802.04537.pdfhttp://arxiv.org/abs/1802.04537.
 Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational Inference with Normalizing Flows. In ICML, 2015. URL http://proceedings.mlr.press/v37/rezende15.pdf.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, 2014. URL https://arxiv.org/pdf/1401.4082.pdf.
 Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In ICML, 2015. URL http://proceedings.mlr.press/v37/salimans15.pdf.
 Shu (2016) Shu, R. Gaussian Mixture VAE: Lessons in Variational Inference, Generative Models, and Deep Nets, 2016. URL http://ruishu.io/2016/12/25/gmvae/.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., Winther, O., and Dk, O. Ladder Variational Autoencoders. In NeurIPS, 2016. URL https://arxiv.org/pdf/1602.02282.pdf.
 Springenberg (2016) Springenberg, J. T. Unsupervised and Semisupervised Learning with Categorical Generative Adversarial Networks. In ICLR, 2016. ISBN 00221007 (Print)r00221007 (Linking). URL https://arxiv.org/pdf/1511.06390.pdf.
 van den Oord et al. (2016a) van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. In ICML, 2016a. URL https://arxiv.org/pdf/1601.06759.pdf.
 van den Oord et al. (2016b) van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional Image Generation with PixelCNN Decoders. In NeurIPS, 2016b. URL http://arxiv.org/abs/1606.05328.
 van den Oord et al. (2017) van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural Discrete Representation Learning. NeurIPS, 2017. URL https://arxiv.org/pdf/1711.00937.pdf.
 Van Der Maaten & Hinton (2008) Van Der Maaten, L. and Hinton, G. Visualizing Data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.
 Weiss et al. (2016) Weiss, K., Khoshgoftaar, T. M., and Wang, D. D. A survey of transfer learning. Journal of Big Data, 3(1), 2016. ISSN 21961115. doi: 10.1186/s4053701600436. URL https://core.ac.uk/download/pdf/81905331.pdf.
 Xian et al. (2018) Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. ZeroShot Learning  A Comprehensive Evaluation of the Good, the Bad and the Ugly, 2018. ISSN 01628828. URL https://arxiv.org/pdf/1707.00600.pdf.
 Xie et al. (2016) Xie, J., Girshick, R., and Farhadi, A. Unsupervised Deep Embedding for Clustering Analysis. ICML, 2016. URL https://arxiv.org/pdf/1511.06335.pdf.
 Yang et al. (2016) Yang, J., Parikh, D., Batra, D., and Tech, V. Joint Unsupervised Learning of Deep Representations and Image Clusters. CVPR, 2016. URL https://arxiv.org/pdf/1604.03628.pdf.
Appendix A Improvements to VAEderived models
Firstly, one can make improvements in the tightness of the variational bound, such as Importance Weighted Autoencoder Burda et al. (2016), though tighter is not always better Rainforth et al. (2018).
Secondly, there are numerous methods for improving the expressiveness of the posterior latent distribution using a normalising flow Rezende & Mohamed (2015), such as an inverse autroregressive flow Kingma et al. (2016).
Thirdly, one can have instead a discrete latent variable, such as that in the VQVAE van den Oord et al. (2017) which is acquired from lookingup a differentiable embedding.
Fourthly, we can add more stochastic layers, such as in a Ladder VAE Sønderby et al. (2016) or stochastic ResNets Kingma et al. (2016).
And finally, it is possible to improve the performance of deep models in general by using more sophisticated networks. Restricting ourselves to VAEbased models: CNNs, such as used in Dilokthanakul et al. (2017); Kilinc & Uysal (2018); Salimans et al. (2015); ResNets Kingma et al. (2016)
; and recurrent neural networks (RNNs)
Pu et al. (2016) can all be used. Variational Lossy Autoencoders Chen et al. (2016) give us a principled way to combine autoregressive decoders, such as an RNN, PixelRNN/CNN van den Oord et al. (2016a, b), or MADE Germain et al. (2015) with VAEs without the decoder modelling all latent structure internally. Relatedly, there have been recent advances in understanding the collapse of the variational posterior when using such powerful networks, through considering the mutual information between and Phuong et al. (2018); Alemi et al. (2018).Appendix B Model Implementation and Data Preprocessing
Each distribution is parameterised by a neural network. Networks are small MLPs with two hidden layers, but for
which has no hidden layers: it simply maps from a onehot encoding of
to and to . For networks representing Gaussian distributions, for M2/GMDGM and for ADGM/AGMDGM, the networks for and are shared up to the second hidden layer, each having its own output layer.Between models, identical model architectures were used for networks with the same inputs and outputs. Kernel initialisation was from a Gaussian distribution with standard deviation of
. Biases were initialised with zeros. Weights were regularised via a Gaussian prior as in Kingma et al. (2014). Our code is based on the template code associated with Gordon & HernándezLobato (2017).We perform stochastic gradient descent to maximise the objectives in each case. We used Adam Kingma & Lei Ba (2015)
, with default moment parameters and a learning rate of
. For the objectives of these models we must approximate the various expectations taken with respect to which we do using the reparameterisation trick, takingsamples in each case. The batch size was 100 for both labelled and unlabelled data. We trained for up to 2000 epochs.
For both MNIST and FashionMNIST we kept only dimensions of the data with a standard deviation greater than 0.1, and treated the resulting masked greyscale images as representing independent Bernoulli distributions. We then binarised this data at train time, taking a draw to represent each image in a batch.
Appendix C Activity of units
In Figure C.5 below we show the average over the test set of:
(36) 
Here , and is indexing over nodes in . We do this for a GMDGM trained unsupervised with , broken out by learnt, unsupervised class, and line colouring indicating the groundtruth class.
Even though we have units, the same small number () have nonzero . Further, in most dimensions the distributions of the cluster components in the generative model are very similar. We can test this via the
divergences between each ordered pair of
distributions. See them plotted in Figure below. Other than the same or so units which we can see are active in Figure C.5, the divergences are small. From this we can see that, for the remaining units, is essentially the same over all classes. This would makes sense if the active, and classvarying, units are encoding class information, and the remaining units with the same distribution, regardless of class, are encoding ‘style’ information.We find that with both GMDGM and AGMDGM we get better performance by increasing the number of extra dimensions in by , and having the dimensionality of be 200 or 300.
Comments
There are no comments yet.