deepbelief
Code to define and train a GPDBN model (Gaussian Process Deep Belief Network).
view repo
The shape of an object is an important characteristic for many vision problems such as segmentation, detection and tracking. Being independent of appearance, it is possible to generalize to a large range of objects from only small amounts of data. However, shapes represented as silhouette images are challenging to model due to complicated likelihood functions leading to intractable posteriors. In this paper we present a generative model of shapes which provides a low dimensional latent encoding which importantly resides on a smooth manifold with respect to the silhouette images. The proposed model propagates uncertainty in a principled manner allowing it to learn from small amounts of data and providing predictions with associated uncertainty. We provide experiments that show how our proposed model provides favorable quantitative results compared with the state-of-the-art while simultaneously providing a representation that resides on a low-dimensional interpretable manifold.
READ FULL TEXT VIEW PDFCode to define and train a GPDBN model (Gaussian Process Deep Belief Network).
The space of silhouette images is challenging to work with as it is not smooth in terms of a representation as pixels. A transformation that we would consider semantically smooth might correspond to a drastic change in pixel values. Our goal is to learn a smooth low dimensional representation of silhouette images such that images can be generated in a natural manner. Further, as data is at a premium, we want to learn a fully probabilistic model that allows us to propagate uncertainty throughout the generative process. This will allow us to learn from smaller amounts of data and also associate a quantified uncertainty to its predictions. This uncertainty allows the model to be used as a building block in larger models.
The results of our model challenge the current trend in unsupervised learning towards maximum likelihood training of increasingly large parametric models with increasingly large datasets. We demonstrate that by propagating uncertainty throughout the model, our approach outperforms two standard generative deep learning models, a Variational Auto-Encoder (VAE
[15]) and a Generative Adversarial Network (InfoGAN [5]) with comparable architectures and can achieve similar performance with far smaller training datasets.In our work we revisit a few classic machine learning models with complementary properties. On the one hand, parametric models such as Restricted Boltzmann Machines (RBMs)
[25] are particularly interesting as they are stochastic, generative and can be stacked easily into deeper models such as deep belief networks (DBNs); these can be trained in a greedy fashion, layer by layer [13]. RBMs can approximate a probability distribution on visible units. DBNs, in addition, learn deep representations by composing features learned by the lower layers, yielding progressively more abstract and flexible representations at higher layers and often leading to more expressive and efficient models compared to shallow ones
[2].However, DBNs suffer from a number of limitations. Firstly, they do not guarantee a smooth representation in the learned latent space. Secondly, the classic contrastive divergence algorithm used for greedy training is slow and can place limitations on architectures. Finally, a DBN does not provide any explicit generative process from a manifold, as the standard way to sample from a DBN is to start from a training example and perform iterations of Gibbs sampling.
The Gaussian Process Latent Variable Model (GPLVM) [17] combines a Gaussian process (GP) prior with a likelihood function in order to learn a representation. By specifying a prior that encourages smooth functions a smooth latent representation can be recovered. However, to make inference tractable the likelihood is also chosen to be Gaussian which does not reflect the statistics of natural images. Further, even though the mapping from the latent space is non-linear the posterior is linear in the observed space. This makes the GPLVM unsuitable for modelling images. To circumvent this one can compose hierarchies of GPs [6], however, these models are inherently difficult to train.
The characteristics of the DBN and GPLVM can be considered complementary, where the DBN excels the GPLVM fails and vice versa. Unfortunately, combining the two models into a single one by simply stacking a GPLVM on top of a DBN would not preserve uncertainty propagation. Furthermore, this would pose a challenge to training (while the GPLVM is a non-parametric model trained by optimizing an objective function, a DBN is a parametric model, with non-differentiable Bernoulli units, and is trained with contrastive divergence). Another important challenge is learning from very little data. The ability to learn from a small dataset expands the applicability of a model to domains where there is a lack of available data or where collection of data is costly or time-consuming.
In this paper we address these challenges and present the following contributions:
A model (which we call GPDBN) that combines the properties of a smooth, interpretable manifold for synthesis with a data specific likelihood function (a deep structure) capable of decomposing images into an efficient representation while propagating uncertainty throughout the model in a principled manner.
We train the model end to end using back propagation with the same complexity as a standard feed-forward neural network by minimising a single objective function.
We also show that the model is able to learn from very little data, outperforming current generative deep learning models, as well as scaling linearly to larger datasets by the use of mini-batching.
Modelling of shape is important for many computer vision tasks. It is beyond the scope of this paper to make a complete review of the topic, we refer the reader to the comprehensive work of Taylor
et al. [7]. In our work we focus on recent unsupervised statistical models that operate directly on the pixel domain. Interest in these models was revived by the Shape Boltzmann Machine (SBM) work of Eslami et al. [10] and they have been shown to be useful for a variety of vision applications [9, 16, 29]. These deep models can also be readily extended into the 3D domain, e.g., by recent work on 3D ShapeNets [32]. Detailed analysis of the DBN, GPLVM and SBM is provided in § 3.Table 1 highlights the desirable properties of the most closely related previous works. We have identified four advantageous properties: (i) It is well known that pixel silhouettes are not well modelled by a Gaussian likelihood. (ii) The utility of an unsupervised shape model is well described by the properties of its latent representation. Ensuring a smooth manifold opens up a number of applications to data in the pixel domain that previously required custom representations, e.g., interactive drawing [30]. (iii) A fully generative model ensures that there is a well defined space that can be sampled as well as interpreted; e.g., dynamics models can be defined in such a space to perform tracking [20, 22]. (iv) Correctly propagating uncertainty is vital to perform data efficient learning, for example when data is scarce or expensive to obtain.
The VAE model by Kingma and Welling [15]
performs a variational approximation of a generative model with a non-Gaussian likelihood through a feed-forward or Multi-Layer Perceptron (MLP) network. In addition, it uses MLP networks to encode the variational parameters (in a similar manner to
[18]). While this model provides a generative mapping, the feed-forward (decoder) network fails to propagate uncertainty from the latent space. Furthermore, the independent prior on the latent space does not promote a smooth manifold; any smoothness arises as a by-product of the MLP encoding network. This characteristic depends on the MLP architecture and is not directly parametrised. The key limitation of the VAE for our purposes is the lack of uncertainty propagation that results in poor results with limited training data.The guided, non-parametric autoencoder model of Snoek et al.
[26] appears similar, however, there are a number of important differences. They use label information (supervision) to guide a latent space learning process for an autoencoder; this is not a pure unsupervised learning task and we do not have label information available to us. Furthermore, as with the VAE, uncertainty is not propagated from the latent manifold to the output space due to the use of the feed-forward network to the output.Another prominent generative model in unsupervised learning is the Generative Adversarial Network (GAN) [11]
. The model learns an implicit generator distribution using a minimax game between a deep generator network, which transforms a noise variable to a sample, and a deep discriminator network, which is used to classify between samples from the generator distribution and the true data distribution. One issue common with GAN models is that they do not provide a smooth latent manifold for synthesis nor uncertainty in their estimates (like the VAE). From the plethora of different variations of GANs models available in the literature we have chosen to include in our comparisons the InfoGAN model
[5], since it also considers the goal of interpretable latent representations (by maximising the mutual information between a subset of GAN’s noise variables and observations).The recent ShapeOdds work of Elhabian and Whitaker [8]
confers state-of-the-art performance and captures many of the desired properties including a generative probabilistic model that propagates uncertainty. The approach taken is quite different to ours as they specify a detailed probabilistic model including a Gaussian Markov Random Field (MRF) with individual Bernoulli random variables for the pixel lattice. In contrast, our model is more flexible, we allow the network to learn the structure from the data directly but ensure that we still maintain uncertainty quantification throughout. We would also argue that the specific form of the low dimensional manifold we generate is desirable with its guaranteed smoothness that makes the latent space readily interpretable. This provides the tradeoff between the two models. We expect the ShapeOdds model to perform very well at generalisation due to the inclusion of the MRF prior. In contrast, our model will be more data dependent in this respect (weaker prior assumptions on the nature of images), however, it provides a generative space that is highly interpretable and easy to work with. We identify that a topic for further work would be to combine our smooth priors with the likelihood model of ShapeOdds.
A possible workaround to the problem of non-Gaussian likelihoods is to perform a deterministic transformation to a domain where the data is approximately Gaussian. This has been successful for domains where, for example, the shape can be represented in a new geometric representation away from pixels, e.g., parametric curves [3, 23]. However, this is application dependent and not suitable for arbitrary pixel based silhouettes considered here. A common approach that retains the pixel grid is to transform it into a level-set problem via the distance transform, e.g., [22]. This can improve results in some settings, however, the uncertainty is not correctly preserved and therefore not correctly captured in predictions. We denote this model GPLVMDT in our comparisons.
The restricted Boltzmann machine (RBM), or Harmonium, [25]
is a generative stochastic neural network that learns a probability distribution over a vector of random variables. The RBM is when stacked the basic the basic component of a deep belief network. The graphical model of the RBM is an undirected bipartite graph, consisting of a set of visible random variables (or units):
, and a set of hidden units (Fig. 1(a)). Typically, all variables are binary (Bernoulli), taking on values from .The RBM model specifies a probability distribution over both the visible and hidden variables jointly as
(1) |
which defines a Gibbs distribution with energy function
(2) |
where , , are the parameters of the model: as a linear weight matrix and
are bias vectors for the visible and hidden units respectively. The normalising constant
is the, computationally intractable, sum over all possible random vectors and .The bipartite structure of the model (i.e., the graph has no visible-visible or hidden-hidden connections, as shown in Fig. 1(a)), affords efficient Gibbs sampling from the visible units given the hidden variables (or vice versa). The conditional distribution of the hidden units given the visible ones, and vice versa, factorize as each set of variables are conditionally independent given the other:
(3) |
Replacing binary units with Gaussian units can be performed by modifying the energy function [12]. Unfortunately, parameter learning is difficult since direct calculation of the gradients of the log likelihood w.r.t. the parameters requires the intractable computation of the normalising constant . In current practice, the approximate maximum-likelihood contrastive divergence algorithm is used [4].
When multiple layers of RBMs are stacked on top of each other they form a deep belief network (Fig. 1(b)). Hinton et al. [13] demonstrated that a DBN can be trained in a greedy fashion, layer by layer. Essentially, the samples (activations) from the hidden units of a trained layer are used as the data to train the next layer in the stack.
Sampling from an RBM proceeds by conditioning on some input data and performing a Gibbs sample for the hidden units. Subsequently, a Gibbs sample can be drawn for the visible units by conditioning the hidden units on this sample. This process is then repeated for a number of cycles. Since a DBN is a stack of RBMs, this process has to be repeated for all layers; the output of one layer becomes the input to condition on for the next layer. In this way, an input data point can be propagated up and down the network.
Although a DBN is good at learning low-dimensional stochastic representations of high-dimensional data, it has three key drawbacks that we will address by combining the strengths of the DBN with a flexible non-parametric model in §
4:It lacks a directed generative sampling process from a well defined latent representation. In order to generate a sample one must condition on some input data and propagate it through the network back and forth until a sample from the lowest layer is obtained.
There is no explicit representation of the uncertainty, instead this only arises implicitly through the propagation of point estimates (samples) at each layer.
A side effect of the conditional independence assumption of (3) is that the correlations between the hidden units of the top layer of a DBN are not captured because each latent dimension is independent. Most importantly, a DBN does not, therefore, give any guarantee about learning a smooth latent space.
The Gaussian Process Latent Variable Model (GPLVM) [17] learns a generative representation by placing a Gaussian process (GP) prior over the mapping from the latent to the observed data. This approach has the benefit that it is very easy to ensure a smooth mapping from the latent representations to the observed data. Further, due to the principled uncertainty propagation of the GP, all predictions will have an associated uncertainty.
In specific, each observed datapoint , , is assumed to be generated by a latent location through a mapping . Due to the marginalising property of a Gaussian, the predictive posterior over function values at a test location can be reached in closed form as,
(4) | |||
(5) | |||
(6) |
where is the covariance function specifying the Gaussian process and . We used the common squared exponential kernel
(7) |
with hyperparameters
(signal variance) and
(lengthscale), to ensure a smooth manifold. Importantly, even though the function can be non-linear, the relationship between the predicted mean (5) and the training data is linear. Due to this linearity, a GPLVM is inherently not suitable for modeling image data.The Shape Boltzmann Machine (SBM) [10] is a specific architecture of the Boltzmann machine. It consists of three layers: a rectangular layer of visible units , and two layers of latent variables: and . Each hidden unit in is connected only to one of the four subsets of visible units of (Fig. 1(c)). Each subset forms a rectangular patch and the weights of each patch (except the biases) are shared so that a patch effectively behaves as a local receptive field. To avoid boundary inconsistencies, the patches are slightly overlapped (in Fig. 1(c), the overlap has size ). Layer is fully connected to .
While the SBM offers improved generalization over a DBN with the same number of parameters, the SBM has a fixed structure which is not easily extended to more layers or patches. In contrast, a DBN, as a stack of simple RBMs, has a more generic and flexible structure which can be adapted easily and combined with other models. Furthermore, like the DBN, the SBM lacks of a proper generative process.
In our model, we connect a DBN and GPLVM so that the data space of the GPLVM corresponds the latent space of the DBN (Fig. 1(d)) to obtain a model that can be optimized by minimizing a single objective function.
The uppermost hidden layer of the DBN has Gaussian units to interface with the Gaussian likelihood of the GPLVM. In the lower layers, we replace the standard binary units with a Concrete distribution [21]
. This is a continuous relaxation to discrete random variables, in our case, to the Bernoulli distribution. This allows us to draw low bias samples, in an analogous manner to the reparameterization trick
[15], using a function that is differentiable with respect to the model parameters,(8) |
where is the parameter of a Bernoulli distribution, is a scaling factor, which we fix to , and is a uniform sample from .
Given a dataset , we train the model end-to-end by minimizing the following objective function jointly with respect to all the parameters and the matrix of latent points (omitted from the notation to avoid clutter):
(9) |
Here, is a training datapoint, is a sample from the model, is the covariance matrix of the latent points and is the number of Gaussian units in the uppermost DBN layer (equal to the dimension of the GPLVM output space). We use a standard Gaussian as the prior on . The variance of the noise parameter is and is an identity matrix.
To join the two models, the matrix of activations , from the Gaussian units, is defined as:
(10) |
where is a matrix in which each row is the mean output of the Gaussian units corresponding to each input training datapoint. This is combined with , the
vector of predictive standard deviations from the GPLVM , and
, the vector of standard deviation parameters of the Gaussian units. Note that is an outer product, and is an element-wise product.The matrix represents the observed data for the GPLVM and is updated at each training iteration by sampling a different matrix of independent Gaussian noise, . This is a second application of the reparameterization trick. At each iteration, is always normalized, to match our zero mean GP assumption, by subtracting its column-wise mean and dividing by .
The objective (9) can be evaluated on an uniformly drawn subset of data yielding an estimator for the full objective,
(11) |
where and corresponds to and evaluated on the subset of . Using this estimator the model can be optimised using mini-batching to scale linearly to larger datasets. We note that the matrix inversion does introduce bias into the estimator; empirical results suggest this is small and removing it is a topic for future work.
When defining the likelihood directly over the pixels, the fully-connected conditional independence of the RBM layers limits scalability in terms of image size. This can be circumvented by adding convolution and deconvolution steps to replace the dense matrix product in (2) in the lower layers.
A sample from the model is drawn by first generating a hidden sample from latent point :
(12) |
using and as the predictive mean and standard deviation of the GPLVM given latent point . This is combined with a sample , a vector of spherical Gaussian noise. The term is the mean vector that is subtracted from in the normalization step. The sample is then propagated down through the DBN, sampling layer-by-layer, to give an output sample .
Since we have a simple sampling process, we can propagate uncertainty for our predictions by taking the empirical mean of a set of samples from the model as for the latent location . Since we can efficiently take gradients through the sampling process, we can project new observations into the latent space by minimizing the reprojection error w.r.t. the latent locations for predictions from a set of random starting locations in the manifold.
We note that the objective (9) consists of terms in contrast with each other. The first encodes a data term that ensures the observed data is well represented by the model. The third provides a complexity term that encourages a simple (low complexity) latent space through the covariance matrix to prevent overfitting.
The second term “glues” the two models together by ensuring that the covariance matrix is a good model of the covariance of the Gaussian units at the top of the DBN. This in turn, ensures that the DBN learns an appropriate network to give sensible Gaussian activations rather than the unconstrained binary activations from a normal DBN. The last term encodes a prior which encourages the latent points to stay close to the origin.
The applications of the reparamerization trick ensures that efficient, low variance samples can be taken during training with gradients propagated throughout all parts of the network. The use of sampling and stochastic networks allows uncertainty to be propagated down through the entire model as well to ensure uncertainty is well quantified both at training and test time.
In keeping with previous work, we evaluated our models in terms of four experiments: (i) Synthesis, that is, generating samples that are plausible. (ii) Representation and Generalisation, demonstrating the ability to capture the variability of the silhouettes away from the training data. (iii) Smoothness
, evaluating the quality of the learned latent space through interpolation; smooth trajectories in the latent space should produce smooth variations in the silhouette space. (iv)
Scaling, evaluating how the model performs with respect to the size of the training dataset.In the comparisons, our main model (which we will refer to as GPDBN) consists of a three-layer DBN plus a GPLVM layer connected as described in § 4. From the bottom (observed) to the top (hidden) layer the architecture consists of (Concrete units), (Concrete) and (Gaussian). The connected GPLVM layer has only latent dimensions for easy visualisation. The model is optimized jointly as described in § 4. Our second model, GPSBM, is similar to the GPDBN where the three-layer DBN has been replaced with an SBM architecture of [10]
with hidden Concrete units in the bottom layer and hidden Gaussian units at the top. We implemented all our models in the TensorFlow
[1] framework and trained using the Adam optimizer [14].For comparison, we compared our models to size baselines: (i) A vanilla GPLVM with latent dimensions. (ii) GPLVMDT, a GPLVM operating on a signed distance function representation in a similar manner to [22]; samples are obtained by thresholding through the hyperbolic tangent function. (iii) The state-of-the-art ShapeOdds model [8]. (iv) A DBN with binary units and the same architecture as our GPDBN. (v) The SBM [10] model with binary units (trained layer by layer with contrastive divergence like the DBN) with the same architecture as our GPSBM. (vi) The VAE [15] model with the same architecture as our GPDBN (mirrored for the decoder) and latent dimensions. (vii) An InfoGAN [5] with same architecture as the VAE and GPDBN (mirrored for the discriminator) and latent dimensions of structured noise.
In keeping with previous work, we trained the models on the Weizmann horse dataset [24], which consists of binary silhouettes of horses facing left. The limited number of training samples and the high variability in the position of heads, tails, and legs make this dataset difficult. We also trained the models on binary images from the Caltech101 dataset of motorbikes facing right [19]. All images in both datasets have been cropped and normalized to pixels. The test datasets consisted of the challenging held-out data from [10]; an additional horses and motorbikes not contained in the training datasets.
Fig. 2(a), shows the manifold learned by the GPDBN on the Weizmann horse dataset. Each blue point on the manifold represents the latent location corresponding to a training datapoint. The heat map is given by the log predictive variance (6) that encodes uncertainty in the latent space. The model is more likely to generate valid shapes from any location in the bright regions (i.e., low variance regions).
Unlike GP based models, a standard DBN (or the SBM) does not learn such a generative manifold. This implies, first of all, that a DBN does not allow us to sample “from the top” in a direct manner. Instead we must provide a test image to the visible units and condition on it before propagating it up and down the network for a few iterations to obtain a sample. Secondly, like the VAE and InfoGAN, a DBN does not provide information about how plausible a generated sample is.
Qualitative comparison of silhouettes generated from low variance manifold areas by each of the models (images manually ordered by visual similarity).
A smooth generative manifold, such the one learned by our model in Fig. 2(a) is informative as it gives us an indication about where to sample from to get plausible silhouettes. Fig. 2(b) compares silhouettes generated by the models that allow sampling from the manifold.^{1}^{1}1When we show generated silhouettes from any model, we actually show grayscale images denoting pixel-wise probabilities of turning white rather than binary samples. We note that the GPLVM and GPLVMDT produce blurry images since the shapes present interpolation artifacts from the Gaussian likelihood. In contrast, the results from both the GPDBN and GPSBM are sharper.
In the recent literature on shape modelling, quantitative results are reported in terms of the distance between the test data not seen by the model and the most likely prediction under the model. For the models that can be sampled from, this amounts to finding the location on the manifold that most closely represents the test input (discussed for our model in § 4). For the models that learn an explicit manifold we find the closest silhouette to a test silhouette by minimising the following objective with respect to a latent location on the manifold:
(13) |
where we use samples to evaluate the cross entropy to the test silhouette. The second term is the log predictive variance of the latent location (as defined in Eq.(6)), this encourages the model to generate plausible silhouettes from the manifold. The scaling factor ensures that the two term have approximatively the same scale.
Samples for a DBN (or SBM) are usually generated by conditioning on an observed sample and propagating it through the network for several cycles, as described in § 3.1, with Gibbs samples taken after a burn in period. In our experiments, we fixed the conditioning on the test datapoint and averaged the results of a number of propagated samples through the model to prevent the sample chain from drifting away from the test data.
To provide a challenging evaluation, we take unseen test data, corrupt it with noise and ask each models to find their most likely silhouette. Simply asking to reconstruct the test data would not be a sufficient evaluation since an identity mapping would be able to perform this task. Instead, we need the model to demonstrate that it can reject data that should not be in the trained model (the noise). In Fig. 3(b), we report the results for our proposed model and the baseline methods. We use the Structured Similarity (SSIM) [31] metric (range [0,1] with high values better) with a small window size of to perform quantitative evaluations since it is known to outperform both cross-entropy and MSE as a perceptual metric. A random sample of corresponding silhouettes for the horse dataset are provided in Fig. 3(a). We also test our model in a more challenging environment, Fig. 4, where test data has been corrupted by significant noise. The quantitative comparisons shown that our GPDBN and GPSBM models have captured a high quality probabilistic estimate of the data manifold while still preserving interpretability.
We trained a GPDBN, VAE and InfoGAN models on a image dataset (which we call stars dataset) generated from a known 1-dimensional manifold using a simple script. The full dataset is displayed in the top row of Fig. 5. The deterministically generated dataset allows us to determine quantitatively whether interpolations in the latent space are representative of the true data distribution. The middle rows of Fig. 5 show the model outputs for the interpolation between two latent points corresponding to a four-pointed star (leftmost sample) and a square (rightmost sample). The uncertainty information of the GPDBN allows us to go from one point to the other passing through low-variance regions by following a geodesic [28]. We can see that the GPDBN produces smoothly varying shapes of high quality that reflect the true manifold. In contrast, the VAE and InfoGAN results do not smoothly follow the true manifold and contain some erroneous interpolants that are not part of the true distribution; this is supported by the quantitative results that measure the quality of the samples to the true data using SSIM. The ability to exploit variance information in the GPDBN is clearly an advantage over the VAE and InfoGAN where the absence of direct access to the latent predictive posterior distribution prevents easy access to geodesics. Further demonstrations of the smoothness are available in supplementary material.
Dataset | |
GPDBN | |
VAE | |
InfoGAN |
GPDBN: | VAE: | InfoGAN: |
In Fig. 6 we compare the performance of the GPDBN, InfoGAN and VAE models as the size of the training dataset increases; here we use the standard MNIST digit dataset. We used a 10-dimensional latent space for all of the three models to account for the larger quantity of data. Similarly to the experiments in Figs. 3 and 4, we took 30 random images from the MNIST test data, add 20% salt-and-pepper noise, and calculated the SSIM score between the output of the models and the test data without noise. We plotted the score against dataset size (in log scale). We can see that the GPDBN model is able to capture a high quality model of the data manifold even from small datasets; for example, it achieves the same quality as a VAE trained on 10,000 images using only 100. We argue that the propagation of uncertainty throughout the model provides the advantage over both the VAE and InfoGAN which are both trained with only maximum likelihood approaches.
In Fig. 7 we provide results that demonstrate that our approach also overcomes scaling issues normally present in GP models and DBNs. Firstly, we show training on the 60,000 MNIST images via our proposed mini-batching approach. In addition, we also show the manifold for higher resolution images from the horse dataset (). By using convolutional architectures, we can scale the number of parameters in an identical manner to convolutional feed-forward networks and our concrete layers allow us to train from random weight initialisation using back propagation without the need to use slow contrastive divergence. With both these approaches we still maintain our full uncertainty model so the same model can perform well with small and large datasets.
We have presented the GPDBN, a model that combines the properties of a smooth, interpretable low-dimensional latent representation with a data specific non-Gaussian likelihood function (for silhouette images). The model fully propagates and captures uncertainty in its estimates, it is trained end to end with the same complexity as a standard feed-forward neural network by minimising a single objective function, and is able to learn from very little data as well as scaling to larger datasets linearly by using mini-batching. We have shown both quantitatively and qualitatively that our model performs on par with the best shape models while at the same time introducing a smooth and low-dimensional latent representation with associated uncertainty that facilitates easy synthesis of data.
This work was supported by the EPSRC CAMERA (EP/M023281/1) grant and the Royal Society.
Carreira-Perpiñán, M.Á., Hinton, G.E.: On Contrastive Divergence Learning. In: Cowell, R.G., Ghahramani, Z. (eds.) Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005. Society for Artificial Intelligence and Statistics (2005)
Elhabian, S.Y., Whitaker, R.T.: ShapeOdds: Variational Bayesian Learning of Generative Shape Models. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 2185–2196. IEEE Computer Society (2017)
Lawrence, N.D.: Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models. Journal of Machine Learning Research
6, 1783–1816 (2005)Please see the interactive manifold demonstration included with the supplemental material. The demonstration is a standalone application that can be viewed with a recent javascript, HTML5 compliant browser by loading the file “interactive_demo.html”. The data and rendering code is contained in the “data” directory is accessed through the standalone webpage. Figure 8 provides an illustration of what the page should look like with instructions for use in blue text. Figures 10 and 10 show the options to select different models, datasets and zoom levels.
In Figure 11 we provide an illustration of the learned manifold for our GPDBN model on the Caltech101 dataset of motorbikes facing right [19]. We also trained a GPDBN on two types of data at the same time (horses and motorbikes). Fig. 12 shows what the manifold looks like (two separate clusters are clearly distinguishable).
A fundamental ML problem is that there is no objective quantitative method of assessing unsupervised learning models. An identity function that simply outputs the test input would be able to achieve perfect reconstruction, however, under such a model all silhouettes would be equally likely (even implausible ones); generalising to implausible shapes is not a property that we want. In contrast, the predictive uncertainty of our GPDBN model tells us how plausible a generated silhouette is (a key feature).
Our proposal for a good quantitative assessment is to measure the reconstruction of the generated output corresponding to a noisy version of the input. An identity function would return the noisy version and be penalised for producing an implausible shape. In contrast, a good model should return a projection to a plausible shape. By using a noisy version the closest plausible shape should be the uncorrupted test image so we can compare to this to quantify the performance.
In addition to the results in the paper we report in Tab. 13(b) the SSIM reconstruction score for 10% salt-and-pepper noisy horses to demonstrates that our models consistently outperform the competition even with very little noise.
Fig. 14, Fig. 15 and Fig. 16 show the results of an experiment where test data has been corrupted by significant noise (20%, 40% and 60% respectively) and we wish to project onto the manifold of valid silhouettes. The quantitative comparisons indicate that our models have managed to capture a good probabilistic estimate of the data manifold while still preserving interpretability.
Method | SSIM |
---|---|
Large net (x3 units) | |
Narrow net (1/3 units) | |
Deep net (+1 layer) | |
Shallow net (-1 layer) |
Given the space constraints in the paper we have given priority to what we believe is the most important; we have provided extensive results and comparisons with recents models (plus an interactive demo). In the following paragraphs we address feature ablation and model robustness.
We note that the two components of our GPDBN model, that is the GPLVM and DBN, are both essential, none of them can be ablated because the former provides the smooth manifold and predictive uncertainty while the latter increases the capability of generating image data. Ablation of our important Concrete units to dropout units impedes uncertainty propagation and reduces performance (Fig. 17). Moreover, our comparisons show that both GPLVM and DBN are weaker as standalone models.
The architectures used were based on previous work to enable fair comparison. In addition, we have trained a GPDBN on the horse dataset experimenting with four different networks (increasing and decreasing the number of units and layers). By removing one layer (Tab. 2) we got slightly better performance to the more generic architecture proposed in the paper. It might be that a shallow network is better suited for this data given the small number of training examples. We see this positively as the model can be fine-tuned to achieve even higher performance depending on the specific data. Finding the optimal number of layers and weights (parameters) is an open issue common to many deep learning methods. We think that replacing the GPLVM part with a Bayesian one ([27]) would solve this problem for the GPDBN allowing it to use the optimal number of parameters automatically.
Traditionally, DBNs do not train on large images, this is because of the high number of parameters therefore, the use of convolutions is necessary. In contrast, GPDBN mini-batching allows us to deal with a large number of images and empirically any introduced bias does not really reduce the performance. For example, we trained a GPDBN on MNIST digits (as in Fig 6 in the paper) with mini-batch size , we obtained SSIM: , which is even higher than the non mini-batched equivalent (SSIM: ).
GPDBN | GPSBM | GPLVM | GPLVMDT |
---|---|---|---|
This additional experiment also highlights the benefit of the uncertainty associated with our model and how it manifests itself in the proposed GPDBN in contrast with the other methods. There is an inherent trade-off between a simple topology of the manifold and the smoothness of the mapping. To exemplify this we generated a dataset of a shape deformed in a cyclic manner Fig. 18. The resulting latent space is structured as a circle clearly reflecting the topology of the deformation. Importantly, if the uncertainty in the model reflects that of the data we should move along ridges of high probability (manifold geodesics) to generate realistic data. Our proposed model is directly applicable to such approaches as described in [28]. Further, the experiment highlights how the uncertainty effects the prediction. When generating shapes corresponding to a region of the manifold where the model is highly uncertain we would, if the model have captured the characteristics of the data well, expect images corresponding to the average shape. As can be seen the GPDBN clearly generates the average shape while the other methods fail to capture this characteristic in the data making it challenging to interpret the uncertainty.
The blue points on the manifolds in Fig. 18 show the results for each of the smooth manifold based models. We note that all the manifolds have correctly identified a smooth trajectory for the training data. In addition, all but the GPSBM have captured the periodic repetition by closing the path; this is possibly due to the symmetry in the dataset not reflecting the shared architecture of the SBM.
The red points represent test locations corresponding to the samples in the third row of Fig. 18. Here we see that all models are correctly interpolating the overall pattern, however, the Gaussian likelihood of the GPLVM introduces artefacts in the silhouettes that are not found in the results from the GPDBN and GPSBM. The GPLVMDT improves over the GPLVM but still produces blurred results.
Finally, the real power of the GPDBN model is captured by looking at what happens when you leave the manifold. The final row of silhouettes are samples from the orange points that are in regions of high predictive variance (low confidence). Both the GPLVM and the GPLVMDT produce completely unreasonable results. Whereas the GPDBN captures the uncertainty in the manifold perfectly; we see the average probability of the entire dataset with the predictive probability correctly captured. The GPDBN results are the mean of a set of samples from the model and away from the manifold these results are correctly approaching the mean of the training data. Interestingly, the results for the GPSBM show the asymmetry in the shared weights; leaving the manifold in two different directions averages different regions of the training data.
Comments
There are no comments yet.