Semi-Unsupervised Learning with Deep Generative Models: Clustering and Classifying using Ultra-Sparse Labels

01/24/2019 ∙ by Matthew Willetts, et al. ∙ University of Oxford 0

We introduce semi-unsupervised learning, an extreme case of semi-supervised learning with ultra-sparse categorisation where some classes have no labels in the training set. That is, in the training data some classes are sparsely labelled and other classes appear only as unlabelled data. Many real-world datasets are conceivably of this type. We demonstrate that effective learning in this regime is only possible when a model is capable of capturing both semi-supervised and unsupervised learning. We develop two deep generative models for classification in this regime that extend previous deep generative models designed for semi-supervised learning. By changing their probabilistic structure to contain a mixture of Gaussians in their continuous latent space, these new models can learn in both unsupervised and semi-unsupervised paradigms. We demonstrate their performance both for semi-unsupervised and unsupervised learning on various standard datasets. We show that our models can learn in an semi-unsupervised manner on Fashion-MNIST. Here we artificially mask out all labels for half of the classes of data and keep 2% of labels for the remaining classes. Our model is able to learn effectively, obtaining a trained classifier with (77.2±1.3)% test set accuracy. We also can train on Fashion-MNIST unsupervised, obtaining (75.2±1.5)% test set accuracy. Additionally, doing the same for MNIST unsupervised we get (96.3±0.9)% test set accuracy, which is state-of-the art for fully probabilistic deep generative models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally classification tasks can be divided in supervised learning, if we have fully labelled data; or semi-supervised learning, if that labelling is sparse. Or we can perform clustering via unsupervised learning if there is no labelled data at all. Semi-supervised learning is common as in many problem domains we do have some labelled data and the amount of unlabelled data is much larger than that of labelled data.

However in reality within our sparsely-labelled dataset there may be classes of data that are entirely unlabelled. That is, there are no labelled exemplars of them, only unlabelled instances. This can be due to selection bias, where the labelled data is from a biased sample of the overall data distribution. And rare class categories might be entirely unobserved in the labelled dataset, only appearing in unlabelled data.

A hypothetical example of an dataset of this type is as follows. Consider a set of medical images, such as scans of tumours. We then obtain ground true labels giving the variety of tumour for some small proportion of all the scans we have. Imagine that we do not happen to capture all distinct types of tumours in this smaller labelled dataset. An unlabelled image could be from one of the varieties that is captured in the labelled dataset, or it could be of another variety. We will not be in the semi-supervised regime, but nor do we want to treat the problem as unsupervised and discard our limited label data.

Naïvely applying semi-supervised learning algorithms to this data will result in attributing all data in the test set merely to the classes represented in the labelled dataset. If we attempt to solve this problem by expanding the dimensionality of the discrete label-space, we find that for some deep probabilistic generative semi-supervised learning algorithms the model will not make good use of these additional components. Classes of data found only in the unlabelled dataset not separated out. This is because these models can not perform clustering, even when the clustering is a sub-problem. For those classes of data not found in the labelled training set, we must perform unsupervised learning.

We are interested in this case, where an unlabelled instance of data could be from one of the sparsely-labelled classes or from an entirely-unlabelled class. We call this semi-unsupervised learning. Here we are jointly performing semi-supervised learning on sparsely-labelled classes, and unsupervised learning on completely unlabelled classes. This requires a model that can learn successfully in both unsupervised and semi-supervised regimes. We build two new models out of two previous deep generative models proposed for semi-supervised learning. The new models can learn in the unsupervised case as well as in the semi-supervised case.

Semi-unsupervised learning has similarities to some varieties of zero-shot learning (ZSL), where deep generative models have been of interest Weiss et al. (2016)

, but in zero-shot learning one has access to auxiliary side information (commonly an ‘attribute vector’) for data at training time, which we do not. Our regime is thus related to transductive generalised ZSL, but with no side information

Xian et al. (2018)

. It also has links to transfer learning

Cook et al. (2013). We are solving two tasks simultaneously. The first is learning our model semi-supervised over the sparsely-labelled classes. The second is performing clustering over the entirely unlabelled classes. The representations in the learnt continuous latent space are shared between the two tasks of semi-supervised learning and clustering. However, we do not have the usual separation between ‘source’ and ‘target’ domains in our problem specification. Because an unlabelled data point could be from any class, be it one represented in the ultra-sparse labelled dataset or one not found there, we are learning to perform these tasks jointly from all available data in one campaign.

We here extend two semi-supervised deep generative models to enable them to learn in the unsupervised case. We do this by enforcing a mixture model in their continuous latent space. Our models are called GM-DGM and AGM-DGM.

We then demonstrate them in both semi-unsupervised and unsupervised regimes. Our models show that we can learn in the semi-unsupervised case, with accuracy higher than if we treated the problem as either semi-supervised or unsupervised.

Further, one of our models achieves state of the art clustering on MNIST for probabilistic deep generative models.

2 Related Work

DGMs have been widely used for both unsupervised and semi-supervised learning. Numerous models build on Variational Autoencoders (VAEs)

Kingma & Welling (2013); Rezende et al. (2014), a variety of deep generative model. For semi-supervised learning, there are the aforementioned models M2 Kingma et al. (2014) and ADGM Maaløe et al. (2016). Further, Maaløe et al. (2016) also propose the Skip Deep Generative Model (SDGM) which shows superior performance to an ADGM on some datasets.

For clustering, both VaDE Jiang et al. (2017) and GM-VAE Dilokthanakul et al. (2017) extend VAEs with some form of mixture model in their learnt, continuous latent space. VaDE has the same forward model as the first model we will propose, but it uses Bayes’ rule to define its classifier/variational posterior over labels, rather than having a separate network parameterising it. The GM-VAE has a mixture of Gaussians in one of its stochastic layers, where this mixture is conditioned on another stochastic variable. The Cluster-aware Generative Model (CaGeM) Maaløe et al. (2017) can, like our models, learn in both unsupervised and semi-supervised regimes. However, the model’s performance at clustering data into components corresponding to ground-truth classes is not given.

Generative Adversarial Networks (GANs) have also been used to approach semi-supervised learning and clustering. Categorical Generative Adversarial Networks (CatGANs) Springenberg (2016) can learn in both regimes. The adversarial autoencoder (AAE) Makhzani et al. (2016), combining a GAN with a probabilistic model, can also learn in both regimes.

Other deep clustering algorithms include IMSAT Hu et al. (2017), DEC Xie et al. (2016), JULE Yang et al. (2016) and ACOL-GAR Kilinc & Uysal (2018)

2.1 Improvements to VAE-derived models

There are various interlinking avenues that can be taken to improve the performance of VAE-like models. We list them in Appendix A. We do not avail ourselves of that work for this paper, so as to clearly isolate the effect of changing the probabilistic structure of the generative model of our DGMs.

3 Deep Generative Models

There are different varieties of deep generative models: fully probabilistic deep generative models, which define a valid joint distribution over both observed data and latent variables; and other types that do not. We choose to pursue the first type for the tasks at hand. In these models, we have a probabilistic graphical model where the parameters of the distributions within that model are themselves parameterised by neural networks. Due to the coherency of probabilistic modelling, these models we can handle missing observations in a principled way. Further, within this framework we can perform partial conditioning to obtain distributions of importance to us.

3.1 Variational Auto-Encoder

The simplest deep generative model of this type is a variational autoencoder Rezende et al. (2014); Kingma & Welling (2013), the deep version of factor analysis. Here there is a continuous unobserved latent and observed data

. The joint probability is

with and where are each parameterised by neural networks with parameters . As exact inference for is intractable, it is standard to perform stochastic amortised variational inference to obtain an approximation to the true posterior Rezende et al. (2014).

To obtain a VAE, introduce a recognition network (where are neural networks with parameters ). Through joint optimisation over

using stochastic gradient descent we aim to find the point-estimates of the parameters

that maximises the evidence lower bound . For the expectation over in we take Monte Carlo (MC) samples. To take derivatives through these samples wrt use the ‘reparameterisation trick’, rewriting a sample from a Gaussian as a deterministic function of a sample from :

(1)

thus we can differentiate a sample w.r.t. , so we can differentiate our MC approximation w.r.t .

3.2 Semi-supervised Learning with Deep Generative Models

To perform semi-supervised classification with a deep generative model, it is necessary to introduce a sometimes-observed discrete class variable into the generative model and into the recognition networks. For a semi-supervised data the evidence lower bound for the model consists of two terms, one for our unlabelled data where is a latent variable to be inferred:

(2)

and the other for our labelled data where is observed:

(3)

First let us consider the model M2 from Kingma et al. (2014). Here:

(4)
(5)

where:

(6)
(7)

and is the discrete prior on . Via simple manipulation one can show that .

Note that , which is to be our trained classifier at the end, only appears in , so it would only be trained on unlabelled data. To remedy this for M2 in Kingma et al. (2014)

, motivated by considering a Dirichlet hyperprior on

, we add to the loss the cross entropy between the one-hot ground-truth label and , weighted by a factor . So the overall objective with unlabelled data and labelled data is the sum of the evidence lower bounds for all data plus this classification loss:

(8)

Eq 8 is of particular importance to us as it is also the evidence lower bound for semi-unsupervised learning. There however, the draws can have a corresponding class that is never observed in .

3.3 Auxiliary Deep Generative Models

Agakov & Barber (2004) introduce a method to get a richer variational distribution. They add an additional latent variable which enters the joint distribution as a conditional distribution given all other variables of the model:

(9)

By construction the original model is obtained when is marginalised out: . This auxiliary latent variable enables us to define a more expressive set of variational distributions.

3.3.1 Motivation for Auxiliary Variables

By adding the auxiliary variable we can now obtain a richer family of mappings between , and in our variational posterior, due to their connection through . Consider the approximate marginal distributions for the latent variables in this model, both for when is latent and when is observed:

(10)
(11)

We can see that Eq 10

is in general a non-Gaussian distribution, recalling than its non-auxiliary counterpart, Eqs

7, is Gaussian. Thus both Eqs 10, 11 are richer than Eqs 5, 7.

3.3.2 Combining Auxiliary Variables with DGMs

Inserting into model M2 gives us the semi-supervised Auxiliary Deep Generative Model (ADGM) of Maaløe et al. (2016):

(12)

where:

(13)
(14)
(15)
(16)

Here is Gaussian or Bernoulli depending on if is continuous or discrete. The inference model is:

(17)

where:

(18)
(19)
(20)

As in 3.2 we have two forms for the evidence lower bound. When is latent:

(21)

and when is observed:

(22)

noting . Our classifier is now:

(23)

Which can be approximated by taking MC samples from . The overall loss for the AGDM is of the same form as Eq 8, cf Eqs 21, 22.

4 Posterior Collapse in Unsupervised and Semi-unsupervised Learning

For both models, M2 and ADGM, when there is no label data at all, when we are just optimising , the model can fail to learn an informative distribution for (this effect is also discussed in Dilokthanakul et al. (2017)). We have found that this collapse also takes place over the subspace of unlabelled classes when carrying out semi-unsupervised learning: often collapses to the prior .

The equivalent posterior collapse observed in is well studied, see Burda et al. (2016); Kingma et al. (2016); Maaløe et al. (2017); Sønderby et al. (2016); Chen et al. (2016). The variational posterior for matches closely the prior on

. This is most acute when the encoder and decoder networks are highly expressive, for instance a deep autoregressive model such as a PixelRNN/CNN

van den Oord et al. (2016a, b).

To understand posterior collapse in , let us write out Eq 2 for M2 in an expanded form:

(24)

We can see that in maximising we are minimising the divergence between and . Commonly is taken to be uniform, and we find that becomes uniform too during training. However, it seems plausible that the local minima associated with having could be left by achieving better reconstruction (maximising ) through an informative, non-degenerate .

Shu (2016) sheds further light on the posterior collapse of for Model M2. Reiterating the arguments, consider the variational posterior for in M2:

(25)

If is a confident classifier, that is is low, then is dominated by one Gaussian component. In the limit of a highly confident classifier this will reduce to a Gaussian distribution. However, if outputs gives a high entropy distribution, then is a mixture of Gaussians with as many components as the number of classes . The claim is that this richer variational distribution, a mixture obtained by having an uninformative classifier, enables M2 to minimise its evidence lower bound better than having separate Gaussians, with a particular one chosen for a given by an informative classifier.

Shu (2016) also notes that the generative model for M2 is a mixture model ‘in disguise’. When we implement in some neural network library, practitioners commonly choose for the neural network to have its input layer take the concatenation of and . Consider , the product of the weight matrix of first layer of the neural network with its concatenated input. Writing the weight matrix as two blocks:

(26)
(27)

Recalling that is a one-hot vector and is drawn from an isotropic unit Gaussian, this gives us a mixture of Gaussians in . Even though we have in

one of the most common statistical models, a Gaussian mixture model, we do not treat it in a special manner to leverage this fact. This motivates

Shu (2016) to present a model for unsupervised learning called the real-GM-VAE, that explicitly models as a Gaussian mixture conditioned on . It has the same forward model as VaDE Jiang et al. (2017), but uses the variational distributions of M2.

These arguments carry over to ADGMs as well, as it has the same forward probabilistic model of M2 up to the addition of an auxiliary variable that does not interfere with the relevant structure.

5 Extending Deep Generative Models to Avoid Posterior Collapse in

5.1 Gaussian mixtures in the continuous latent space in Deep Generative Models

We propose two models, one extending from M2/real-GM-VAE of Shu (2016) and one from the ADGM of Maaløe et al. (2016), that can handle both unsupervised and semi-supervised learning, thus enabling us to capture semi-unsupervised learning. We call these models the Gaussian Mixture Deep Generative Model (GM-DGM) and the Auxiliary Gaussian Mixture Deep Generative Model (AGM-DGM) respectively. The generative model structure of each has a mixture of Gaussians in conditioned on . We demonstrate both the GM-DGM and AGM-DGM in unsupervised and semi-unsupervised settings, for MNIST and Fashion-MNIST.

5.2 Gaussian Mixture DGM

x

z

y

(a)

x

z

y

(b)

x

z

y

(c)

x

z

y

(d)
Figure 1: Generative and Inference models for GM-DGM, where is the number of unlabelled points and the number of labelled points.

The generative model for the data is:

(28)
(29)
(30)
(31)

We then perform amortised stochastic variational inference, with variational distributions as before for M2 - Eqs 5 - 8. See Fig 1 for a graphical representation of our model.

5.3 Auxiliary Gaussian Mixture DGM

x

z

y

a

(a)

x

z

y

a

(b)

x

z

y

a

(c)

x

z

y

a

(d)
Figure 2: Generative and Inference models for AGM-DGM, where is the number of unlabelled points and the number of labelled points.

Here we move to a mixture of Gaussians in for the ADGM, to obtain the Auxiliary Gaussian Mixture deep generative model, or AGM-DGM. The generative model for the data is:

(32)

with generative networks as in Eqs 16, 29-31 and inference networks as in Eqs 17-20.

See Fig. 2 for a graphical representation of our model.

5.4 Classifier for models with Auxiliary Variables

For the ADGM/AGM-DGM we approximate the classifier as:

(33)

Also in our objectives we have expectations over . Using the above approximation to for the auxiliary models, we then perform those calculations exactly.

6 Experiments

In our experiments we perform:
1) semi-unsupervised learning on Fashion-MNIST for both our models and the models M2 and ADGM as a baseline
2) clustering on Fashion-MNIST and MNIST for both our models, comparing against various published baselines.

We also restrict ourselves only to small dense neural networks. We describe our model implementation in detail and our data preprocessing method in Appendix B.111Code is on Github here.

For the these experiments we must specify the dimensionality of , and the prior . For both unsupervised and semi-unsupervised runs we augment , adding extra classes to the ground truth classes we know exist. So . In the unsupervised case we put a uniform prior on .

Model Overall Semi-Sup: Un-Sup
Baselines
M2 300 15
ADGM 300 100 15
Our Models
GM-DGM 200 15
GM-DGM 200 30
GM-DGM 300 15
AGM-DGM 200 100 15
AGM-DGM 300 100 15
Table 1: Test set classification accuracy of GM-DGM and AGM-DGM trained semi-unsupervised on Fashion-MNIST for different hyper-parameters, over four runs. Results are also subdivided between semi-supervised (Semi-Sup) classes, 5-9, and unsupervised (Un-Sup) classes 0-4.

6.1 Masking and Sparsity for Semi-unsupervised Experiments

For our semi-unsupervised runs on Fashion-MNIST, we mask out the labels of certain classes entirely and give sparse labels for the remaining classes. We mask out classes, , and the remaining classes are sparsely-labelled. We kept of labels for in the training data. For Fashion-MNIST with 60,000 training points this is 1200 labelled examples for each such class.

We divide up our prior evenly between the ‘vacated’ classes and the classes we are adding. This means we have:

(34)
(35)

For our semi-unsupervised results, we gave of the training data with labels for each semi-supervised class. For Fashion-MNIST with 60,000 training points this is 1200 labelled examples for each such class.

6.2 Dimensionality of

In our models we choose a relatively large dimensionality for . Partly this is because of the well-known challenge of posterior collapse in , as mentioned in Sec 4, which means that only a proportion of units in the networks representing give outputs significantly different to the prior. But also because in these mixture models

encodes both ‘style’ and class information, through the mean and variance of the forward model. In Appendix

C we discuss the evidence for this.

6.3 Evaluating performance on unlabelled classes

As is common, to evaluate clustering performance on the labelled test-set, we follow a test-time cluster-and-label approach: we attribute the learnt, unsupervised classes to the most common ground truth class within it at test time. From this we can calculate accuracy. As discussed above, the GM-DGM has previously been proposed for clustering.

7 Results

7.1 Visualising

To show that we achieve good separation in between classes in these models when trained unsupervised, we do a two-dimensional t-SNE embedding Van Der Maaten & Hinton (2008) of the mean of over the test set, coloured by ground truth class, Figure 3. This is from a GM-DGM with , the same run as used to make Figures C.5, C.6 in the appendix.

Figure 3: A 2D t-SNE embedding of means of the approximate posterior distribution on the test set, where and . Best viewed in colour.

7.2 Semi-Unsupervised Runs

7.2.1 Fashion-MNIST Results

We show the results after semi-unsupervised learning for both our models and for models M2 and ADGM in Table 1. Recall that we are masking out all labels for classes 0-4, and keeping of labels for classes 5-9. We can see that M2 and ADGM do not perform well within the unsupervised sub-problem. Our models do learn, and have similar accuracy for semi-supervised and unsupervised classes.

Figure 4 compares the confusion matrices for models M2 and AGM-DGM. We see here M2’s posterior collapse in the subspace of unsupervised classes in , and that our AGM-DGM model avoids this collapse. Instead it performs accurate classification of the test set with a learnt that is confident for classes where the training data was from either unsupervised or semi-supervised classes. M2 and ADGM give similar confusion matrices to each other, as do GM-DGM and AGM-DGM. We note that our explicit mixture models have superior performance overall.

(a) M2
(b) AGM-DGM
Figure 4: Confusion matrices of the Fashion-MNIST test set for models M2 and AGM-DGM, from runs in Table 1.

7.3 Unsupervised Runs

7.3.1 Fashion-MNIST Results

See Table 2 for clustering performance of our models. As Fashion-MNIST is a relatively new dataset, there are not the same baselines available. This does not matter for us, as the primary purpose of these results to compare them to the performance obtained on the unsupervised-subproblem within semi-unsupervised learning.

For our models we see that the accuracy obtained on the unsupervised classes here is lower than the accuracy on the unsupervised classes within semi-unsupervised learning in Table 1

for the same model hyperparameters.

Baseline Model
Model
DEC Guo (2018) 61.8
Our Models
Model
GM-DGM 200 15
GM-DGM 200 30
GM-DGM 300 15
GM-DGM 300 30
AGM-DGM 200 100 15
AGM-DGM 200 100 30
AGM-DGM 300 100 15

Table 2: Test set accuracy of GM-DGM and AGM-DGM trained unsupervised on Fashion-MNIST for different hyper-parameters, over four runs, and compared to one baseline.

7.3.2 MNIST Results

Table 3 presents the clustering performance of our models compared against published baselines. We have divided the baselines into blocks by type of model. The first is VAE-like DGMs, the second is GAN-like DGMs and the third is other deep clustering algorithms, with the best in each block in bold. For the ADM-DGM, training was unstable for some settings, sometimes showing posterior collapse in . The best run was from an ADG-DGM with accuracy of 97.9 () was better than any from GM-DGM.

Our results show we can reach test set accuracy which is state-of-the-art clustering of MNIST for probabilistic deep generative models, but we do not outperform IMSAT Hu et al. (2017) and ACOL-GAR Kilinc & Uysal (2018) at this task.

Baseline Models
Model
VaDE Jiang et al. (2017)
GMVAE Dilokthanakul et al. (2017) 83.2 3.8
GMVAE Dilokthanakul et al. (2017) 87.8 5.3
GMVAE Dilokthanakul et al. (2017) 92.8 1.6
CatGAN Maaløe et al. (2017) 90.3
AAE Makhzani et al. (2016) 90.5 2.2
AAE Makhzani et al. (2016) 1.1
IMSAT Hu et al. (2017) 0.4
DEC Xie et al. (2016) 84.3
JULE Yang et al. (2016) 96.1
ACOL-GAR Kilinc & Uysal (2018) 98.3 0.1
Our Models
Model
GM-DGM 200 15 91.1 3.5
GM-DGM 200 30 0.9
GM-DGM 300 15 94.5 1.2
GM-DGM 300 30 95.7 0.9
AGM-DGM 200 100 15 90.0 3.9
AGM-DGM 200 100 30 95.7

Table 3: Test set accuracy of GM-DGM and AGM-DGM trained unsupervised on MNIST for different hyper-parameters, over four runs, compared to various published baselines. We have divided the baselines into blocks by type of model.

8 Conclusion

We introduced semi-unsupervised learning, a regime that requires a model to be able to cluster while also learning from semi-supervised data. We presented two models that can do this, each made by changing previous DGMs so as to enable them to perform both clustering and semi-supervised learning jointly.

We do this by making DGMs that explicitly have a mixture model in their latent space. We have demonstrated that our models can learn in the semi-unsupervised regime.

Both the unsupervised and semi-supervised classes show increases in accuracy. Of particular interest is that the accuracy obtained on the unsupervised classes within the semi-unsupervised problem is higher than when training the model entirely unsupervised. This tells us that in the has been internal transfer learning taking place between the representations of the semi-supervised and unsupervised classes, improving the results of the latter. The models are effectively leveraging the limited label data to improve the representations for all classes.

Despite the superior published semi-supervised performance of ADGM against M2, we do not see a significant difference between AGM-DGM and GM-DGM.

Further work could be to add to these models the various improvements available, as listed in Appendix A. Also other DGMs than can learn in both required regimes could be explored under semi-unsupervised learning, such as GAN-based methods like CatGANs and AAEs. We hope that this new learning regime is further studied, as algorithms that can work within it could be of great use.

Acknowledgements

We would like to thank Raza Habib, Aiden Doherty and Rui Shu for their useful discussion. We also thank Miguel Morin for help with running experiments.

References

Appendix A Improvements to VAE-derived models

Firstly, one can make improvements in the tightness of the variational bound, such as Importance Weighted Autoencoder Burda et al. (2016), though tighter is not always better Rainforth et al. (2018).

Secondly, there are numerous methods for improving the expressiveness of the posterior latent distribution using a normalising flow Rezende & Mohamed (2015), such as an inverse autroregressive flow Kingma et al. (2016).

Thirdly, one can have instead a discrete latent variable, such as that in the VQ-VAE van den Oord et al. (2017) which is acquired from looking-up a differentiable embedding.

Fourthly, we can add more stochastic layers, such as in a Ladder VAE Sønderby et al. (2016) or stochastic ResNets Kingma et al. (2016).

And finally, it is possible to improve the performance of deep models in general by using more sophisticated networks. Restricting ourselves to VAE-based models: CNNs, such as used in Dilokthanakul et al. (2017); Kilinc & Uysal (2018); Salimans et al. (2015); ResNets Kingma et al. (2016)

; and recurrent neural networks (RNNs)

Pu et al. (2016) can all be used. Variational Lossy Autoencoders Chen et al. (2016) give us a principled way to combine autoregressive decoders, such as an RNN, PixelRNN/CNN van den Oord et al. (2016a, b), or MADE Germain et al. (2015) with VAEs without the decoder modelling all latent structure internally. Relatedly, there have been recent advances in understanding the collapse of the variational posterior when using such powerful networks, through considering the mutual information between and Phuong et al. (2018); Alemi et al. (2018).

Appendix B Model Implementation and Data Preprocessing

Each distribution is parameterised by a neural network. Networks are small MLPs with two hidden layers, but for

which has no hidden layers: it simply maps from a one-hot encoding of

to and to . For networks representing Gaussian distributions, for M2/GM-DGM and for ADGM/AGM-DGM, the networks for and are shared up to the second hidden layer, each having its own output layer.

Between models, identical model architectures were used for networks with the same inputs and outputs. Kernel initialisation was from a Gaussian distribution with standard deviation of

. Biases were initialised with zeros. Weights were regularised via a Gaussian prior as in Kingma et al. (2014). Our code is based on the template code associated with Gordon & Hernández-Lobato (2017).

We perform stochastic gradient descent to maximise the objectives in each case. We used Adam Kingma & Lei Ba (2015)

, with default moment parameters and a learning rate of

. For the objectives of these models we must approximate the various expectations taken with respect to which we do using the reparameterisation trick, taking

samples in each case. The batch size was 100 for both labelled and unlabelled data. We trained for up to 2000 epochs.

For both MNIST and Fashion-MNIST we kept only dimensions of the data with a standard deviation greater than 0.1, and treated the resulting masked greyscale images as representing independent Bernoulli distributions. We then binarised this data at train time, taking a draw to represent each image in a batch.

Appendix C Activity of units

In Figure C.5 below we show the average over the test set of:

(36)

Here , and is indexing over nodes in . We do this for a GM-DGM trained unsupervised with , broken out by learnt, unsupervised class, and line colouring indicating the ground-truth class.

Even though we have units, the same small number () have non-zero . Further, in most dimensions the distributions of the cluster components in the generative model are very similar. We can test this via the

divergences between each ordered pair of

distributions. See them plotted in Figure below. Other than the same or so units which we can see are active in Figure C.5, the divergences are small. From this we can see that, for the remaining units, is essentially the same over all classes. This would makes sense if the active, and class-varying, units are encoding class information, and the remaining units with the same distribution, regardless of class, are encoding ‘style’ information.

We find that with both GM-DGM and AGM-DGM we get better performance by increasing the number of extra dimensions in by , and having the dimensionality of be 200 or 300.

Figure C.5: Average activation over the test set calculated by unit in for GM-DGM trained unsupervised on MNIST with , . Calculated per node as , where and . We use this definition as the internal representation of the model is not the same as the in the ground truth: we have permutations and also multiple components in the model can correspond to one ground-truth class. We stratify the results by and colour by ground truth class . We are ordering the units by overall weighed . We only show the first 20 units, the remainder having a KL divergence close to zero across the test set. Best viewed in colour.
Figure C.6: Unit-wise divergences between conditional distributions from GM-DGM trained unsupervised on MNIST. From the same trained instance as in Figure C.5, with and . Ordering of units is the same for all sub-graphs as in that figure. These graphs are best viewed digitally. We can see that outside the first units, the unit-wise divergence drops to close to zero for all ordered pairs of values of . Note that different graphs have different y-axis scales.