Deep Compressed Sensing

05/16/2019 ∙ by Yan Wu, et al. ∙ 0

Compressed sensing (CS) provides an elegant framework for recovering sparse signals from compressed measurements. For example, CS can exploit the structure of natural images and recover an image from only a few random measurements. CS is flexible and data efficient, but its application has been restricted by the strong assumption of sparsity and costly reconstruction process. A recent approach that combines CS with neural network generators has removed the constraint of sparsity, but reconstruction remains slow. Here we propose a novel framework that significantly improves both the performance and speed of signal recovery by jointly training a generator and the optimisation process for reconstruction via meta-learning. We explore training the measurements with different objectives, and derive a family of models based on minimising measurement errors. We show that Generative Adversarial Nets (GANs) can be viewed as a special case in this family of models. Borrowing insights from the CS perspective, we develop a novel way of improving GANs using gradient information from the discriminator.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Encoding and decoding are central problems in communication (MacKay, 2003). Compressed sensing (CS) provides a framework that separates encoding and decoding into independent measurement and reconstruction processes (Candes et al., 2006; Donoho, 2006). Unlike commonly used auto-encoding models (Bourlard & Kamp, 1988; Kingma & Welling, 2013; Rezende et al., 2014), which feature end-to-end trained encoder and decoder pairs, CS reconstructs signals from low-dimensional measurements via online optimisation. This model architecture is highly flexible and sample efficient: high dimensional signals can be reconstructed from a few random measurements with little or no training at all. CS has been successfully applied in scenarios where measurements are noisy and expensive to take, such as in MRI (Lustig et al., 2007). Its sample efficiency enables the development of, for example, the “single pixel camera”, which reconstructs a full resolution image from a single light sensor (Duarte et al., 2008).

However, the wide application of CS, especially in processing large scale data where modern deep learning approaches thrive, is hindered by its assumption of sparse signals and the slow optimisation process for reconstruction. Recently,

Bora et al. (2017) combined CS with separately trained neural network generators. Although these pre-trained neural networks were not optimized for CS, they demonstrated reconstruction performance superior to existing methods such as the Lasso (Tibshirani, 1996). Here we propose the deep compressed sensing (DCS) framework in which neural networks can be trained from-scratch for both measuring and online reconstruction. We show that this framework leads naturally to a family of models, including GANs (Goodfellow et al., 2014), which can be derived by training the measurement functions with different objectives. In summary, this work contributes the following:

  • We demonstrate how to train deep neural networks within the CS framework.

  • We show that a meta-learned reconstruction process leads to a more accurate and orders of magnitudes faster method compared with previous models.

  • We develop a new GAN training algorithm based on latent optimisation, which improves GAN performance. The non-saturated generator loss emerges as a measurement error.

  • We extend our framework to training semi-supervised GANs, and show that latent optimisation results in semantically meaningful latent spaces.

Figure 1: Illustration of Deep Compressed Sensing. is a measurement process that produces a measurement of the signal, and is a generator that reconstructs the signal from a latent representation . The latent representation is optimised to minimise a measurement error .

Notations

We use bold letters for vectors and matrices and normal letters for scalars.

indicates taking the expectation of over the distribution . We use subscriptions of Greek letters to indicate function parameters. For example, is a function parametrised by .

2 Background

2.1 Compressed Sensing

Compressed sensing aims to recover signal from a linear measurement :

(1)

where is the measurement matrix, and

is the measurement noise which is usually assumed to be Gaussian distributed.

is typically a “wide” matrix, such that . As a result, the measurement has much lower dimensionality compared with the original signal; solving is generally impossible for such under-determined problems. The elegant CS theory shows that one can nearly perfectly recover

with high probability given a random matrix

and sparse (Donoho, 2006; Candes et al., 2006). In practice, the requirement that be sparse can be replaced by sparsity in a set of basis , such as the Fourier basis or wavelet, so that can be non-sparse signals such as natural images. Here we omit the basis

for brevity; the linear transform from

does not affect our following discussion.

At the centre of CS theory is the Restricted Isometry Property (RIP) 111

The theory can also be proved from the closely related and more general Restricted Eigenvalue condition

(Bora et al., 2017). We focus on RIP in this form for its more straightforward connection with the training loss (see section 3.1)., which is defined for and the difference between signals as

(2)

where is a small constant. The RIP states that the projection from preserves the distance between two signals bounded by factors of and . This property holds with high probability for various random matrices and sparse signals . It guarantees minimising the measurement error

(3)

under the constraint that is sparse, leads to accurate reconstruction with high probability (Donoho, 2006; Candes et al., 2006). This constrained optimisation problem is computationally intensive — a price for the measuring process that only requires sparse random projections of the signals (Baraniuk, 2007).

2.2 Compressed Sensing using Generative Models

The requirement of sparsity poses a strong restriction on CS. Sparse bases, such as the Fourier basis or wavelet, only partially relieve this constraints, since they are restricted to domains known to be sparse in these bases and cannot adapt to data distributions. Recently, Bora et al. (2017) proposed compressed sensing using generative models (CSGM) to relax this requirement. This model uses a pre-trained deep neural network (from a VAE or GAN) as the structural constraint in the place of sparsity. This generator maps a latent representation to the signal space:

(4)

Instead of requiring sparse signals, implicitly constrains output in a low-dimensional manifold via its architecture and the weights adapted from data. This constraint is sufficient to provide a generalised Set-Restricted Eigenvalue Condition (S-REC) with random matrices, under which low reconstruction error can be achieved with high probability. A minimisation process similar to that in CS is used for reconstruction:

(5)
(6)

such that is the reconstructed signal. In contrast to directly optimising the signal in CS (eq.3), here optimisation is in the space of latent representation .

The operator in eq. 5 is intractable since is highly non-convex. It is therefore approximated using gradient descent starting from a randomly sampled point :

(7)

where is a learning rate. One can take a specified steps of gradient descent. Typically, hundreds or thousands of gradient descent steps and several re-starts from the initial step are needed to obtain a sufficiently good (Bora et al., 2017; Bojanowski et al., 2018). This process is illustrated in Figure 1.

This work established the connection between compressed sensing and deep neural networks, and demonstrated performance superior to the Lasso (Tibshirani, 1996), especially when the number of measurements is small. The theoretical properties of CSGM have been more closely examined by Hand & Voroninski (2017), who also proved stronger convergence guarantees. More recently, Dhar et al. (2018) proposed additional constraints to allow sparse deviation from the generative model’s support set, thus improving generalisation. However, CSGM still suffers from two restrictions:

  1. The optimisation for reconstruction is still slow, as it requires thousands of gradient descent steps.

  2. It relies on random measurement matrices, which are known to be sub-optimal for highly structured signals such natural images. Learned measurements can perform significantly better (Weiss et al., 2007).

2.3 Model-Agnostic Meta Learning

Meta-learning, or learning to learn, allows a model adapting to new tasks by self-improving (Schmidhuber, 1987). Model-Agnostic Meta learning (MAML) provides a general method to adapt parameters for a number of tasks (Finn et al., 2017)

. Given a differentiable loss function

for task sampled from the task distribution , the task-specific parameters are adapted by gradient descent from the initial parameters :

(8)

The initial parameters are trained to minimise the loss across all tasks

(9)

Multiple steps and more sophisticated optimisation algorithms can be used in the place of eq. 8. Despite usually being a highly non-convex function, by back-propagating through the gradient-descent process, only a few gradient steps are sufficient to adapt to new tasks.

2.4 Generative Adversarial Networks

A Generative Adversarial Network (GAN) trains a parametrised generator to fool a discriminator that tries to distinguish real data from fake data sampled from the generator (Goodfellow et al., 2014). The generator is a deterministic function that transforms samples from a source to the same space as the data , which has the distribution . This adversarial game can be summarised by the following min-max problem with the value function :

(10)

GANs are usually difficult to train due to this adversarial game (Balduzzi et al., 2018). Training may either diverge or converge to bad equilibrium with, for example, collapsed modes, unless extra care is taken in designing and training the model (Radford et al., 2015; Salimans et al., 2016).

A widely adapted trick is using as the objective for the generator (Goodfellow et al., 2014). Compared with eq. 10, this alternative objective avoids saturating the discriminator in the early stage of training when the generator is too weak. However, this objective voids most theoretical analyses (Hu et al., 2018), since the new adversarial objective is no longer a zero-sum game (eq. 10).

In most GAN models, discriminators become useless after training. Recently, Tao et al. (2018) and Azadi et al. (2019) proposed methods using the discriminator for importance sampling. Our work provides an alternative: our model moves latent representations to areas more likely to generate realistic images as deemed by the discriminator.

3 Deep Compressed Sensing

We start by showing the benefit of combining meta-learning with the model in Bora et al. (2017). We then generalise measurement matrices to parametrised measurement functions, including deep neural networks. While previous work relies on random projections as measurement functions, our approach learns measurement functions by imposing the RIP as a training objective. We then derive two novel models by imposing properties other than the RIP on the measurements, including a GAN model with discriminator-guided latent optimisation, which leads to more stable training dynamics and better results.

3.1 Compressed Sensing with Meta-Learning

We hypothesise that the run-time efficiency and performance in CSGM (Bora et al. 2017, section 2.2), can be improved by training the latent optimisation procedure using meta-learning, by back-propagating through the gradient descent steps (Finn et al., 2017). The latent optimisation procedure for CS models can take hundreds or thousands of steps. By employing meta-learning to optimise this optimisation procedure we aim to achieve similar results with far fewer updates.

To this end, the model parameters, as well as the latent optimisation procedure, are trained to minimise the expected measurement error:

(11)

where is obtained from gradient descent (eq. 7). The gradient descent in eq. 7 and the loss function in eq. 11 mirror their counterparts in MAML (eq. 8 and 9), except that:

  1. Instead of the stochastic gradient computed in the outside loop, here each measurement error only depends on a single sample , so eq. 7 computes the exact gradient of .

  2. The online optimisation is over latent variables rather than parameters. There are usually much fewer latent variables than parameters, so the update is quicker.

Like in MAML, we implicitly perform second order optimisation, by back-propagating through the latent optimisation steps which compute when optimising eq. 11. We empirically observed that this dramatically improves the efficiency of latent optimisation, with only 3-5 gradient descent steps being sufficient to improve upon baseline methods.

Unlike Bora et al. (2017), we also train the generator . Merely minimising eq. 9 would fail — the generator can exploit by mapping all into the null space of . This trivial solution always gives zero measurement error, but may contain no useful information. Our solution is to enforce the RIP (eq. 2) via training, by minimising the measurement loss:

(12)

and can be sampled in various ways. While the choice is not unique, it is important to sample from both the data distribution and generated samples , so that the trained RIP holds for both real and generated data. In our experiments, we randomly sampled one image from the data and two generated images at the beginning and end of latent optimisation, then computed the average between the 3 pairs of losses between these 3 points as a form of “triplet loss”.

Our algorithm is summarised in Algorithm 1. Since Algorithm 1 still uses a random measurement matrix , it can be used as any other CS algorithm when ground truth reconstructions are available for training the generator.

  Input: minibatchs of data , random matrix , generator , learning rate , number of latent optimisation steps
  repeat
     Initialize generator parameters
     for  to  do
        Measure the signal
        Sample
        for  to  do
           Optimise
        end for
     end for
     
     Compute using eq. 12
     Update
  until reaches the maximum training steps
Algorithm 1 Compressed Sensing with Meta Learning

3.2 Deep Compressed Sensing with Learned Measurement Function

In Algorithm 1, we use the RIP property to train the generator. We can use the same approach and enforce the RIP property to learn the measurement function itself, rather than using a random projection.

3.2.1 Learning Measurement Function

We start by generalising the measurement matrix (eq. 1), and define a parametrised measurement function . The model introduced in the previous section corresponds to a linear function ; now both and can be deep neural networks. Similar to CS, the central problem in this generalised setting is inverting the measurement function to recover the signal via minimising the measurement error similar to eq. 6:

(13)

The distance preserving property as a counterpart of the RIP can be enforced by minimising a loss similar to eq. 12:

(14)

Minimising provides a relaxation of the constraint specified by the RIP (eq. 2). When is small, the projection from better preserves the distance between and . This relaxation enables us to transform the RIP into a training objective for the measurements, which can then be integrated into training other model components. Empirically, we found this relaxation leads to high quality reconstruction.

The rest of the algorithm is identical to Algorithm 1, except that we also update the measurement function’s parameters . Consequently, different schemes can be employed to coordinate updating and , which will be discussed more in section 3.3. This extended algorithm is summarised in Algorithm 2. We call it Deep Compressed Sensing (DCS) to emphasise that both the measurement and reconstruction can be deep neural networks. Next, we turn to generalising the measurements to properties other than the RIP.

  Input: minibatchs of data , measurement function , generator , learning rate , number of latent optimisation steps
  repeat
     Initialize generator parameters
     for  to  do
        Measure the signal
        Sample
        for  to  do
           Optimise
        end for
     end for
     
     Compute using eq. 12
     Option 1 : joint update
     Option 2 : alternating update
                      
  until reaches the maximum training steps
Algorithm 2 Deep Compressed Sensing

3.2.2 Generalised CS 1: CS-GAN

Here we consider an extreme case: a one-dimensional measurement that only encodes how likely an input is a real data point or fake one sampled from the generator. One way to formulate this is to train the measurement function using the following loss instead of eq. 14:

(15)

Algorithm 2 then becomes the Least Squares Generative Adversarial Nets (LSGAN, Mao et al., 2017) with latent optimisation — they are exactly equivalent when the latent optimisation is disabled (, zero step). LSGAN is an alternative to the original GAN (Goodfellow et al., 2014) that can be motivated from Pearson Divergence. To demonstrate a closer connection with original GANs (Goodfellow et al., 2014)

, we instead focus on another formulation whose measurement function is a binary classifier (the discriminator).

This is realised by using a binary classifier as the measurement function, where we can interpret as the probability that comes from the dataset. In this case, the measurement function is equivalent to the discriminator in GANs. Consequently, we change the the squared-loss in eq. 13 to the cross-entropy loss as the matching measurement loss function (Bishop, 2006) (ignoring the expectation over for brevity):

(16)

where the binary scalar is an indicator function identifies whether is a real data point.

(17)

Similarly, a cross-entropy measurement error is employed to quantify the discrepancy between and the scalar measurement :

(18)

At the minimum of (eq. 16), the optimal measurement function is achieved by the perfect classifier:

(19)

We can therefore simplify eq. 18 by replacing with its target value as in teacher-forcing (Williams & Zipser, 1989):

(20)

This objective recovers the vanilla GAN formulation with the commonly used alternative loss (Goodfellow et al., 2014), which we derived as a measurement error. When latent optimisation is disabled (), Algorithm 2 is identical to a vanilla GAN.

In our experiments (section 4.2), we observed that the additional latent optimisation steps introduced from the CS perspective significantly improved GAN training. We reckon this is because latent optimisation moves the representation to areas more likely to generate realistic images as deemed by the discriminator. Since the gradient descent process remains local, the latent representations are still spread broadly in latent space, which avoids mode collapse. Although a sufficiently powerful generator can transform the source into arbitrarily complex distribution, a more informative source, as implicitly manifested from the optimised , may significantly reduce the complexity required for , thus striking a better trade-off in terms of the overall computation.

3.2.3 Generalised CS 2: Semi-supervised GANs

So far, we have shown two extreme cases of Deep Compressed Sensing: in one case, the distance preserving measurements (section 3.2.1) essentially encode all information for recovering the original signals; on the other hand, the CS-GAN (section 3.2.2) has one-dimensional measurements that only indicates whether signals are real or fake. We now seek a middle ground, by using measurements that preserve class information for labelled data.

We generalise CS-GAN by replacing the binary classifier (discriminator) with a multi-class classifier . For data with classes, this classifier outputs classes with the ’th class reserved for “fake” data that comes from the generator. This specification is the same as the classifier used in semi-supervised GANs (SGANs, Salimans et al. (2016)). Consequently, we extend the binary indicator function in eq. 17 to multi-class indicator, so that its ’the element when in class . The ’th output of the classifier indicates the predicted probability that is in the ’th class, and multi-class cross-entropy loss is used for the measurement loss and measurement error:

(21)
(22)

When latent optimisation is disabled (), the model is similar to other semi-supervised GANs (Salimans et al., 2016; Odena et al., 2017). However, when the online optimisation moves latent representations towards regions representing particular classes. This provides a novel way of training conditional GANs.

Compared with conditional GANs which concatenate labels to latent variables (Mirza & Osindero, 2014), optimising latent variables is more adaptive and uses information from the entire model. Compared with Batch-Norm based methods (Miyato & Koyama, 2018), the information for conditioning is presented in the target measurements, and does not need to be trained as Batch-Norm statistics (Ioffe & Szegedy, 2015). Since both of these methods use separate sources (label inputs or batch statistics) to provide the condition, their latent variables tend to retain no information about the condition. Our model, on the other hand, distils the condition information into the latent representation, which results in semantically meaningful latent space (Figure 5).


Model Property Loss
CS RIP N/A
DCS trained RIP eq. 14
CS-GAN validity preserving eq. 16
CS-SGAN class preserving eq. 21
Table 1: A family of DCS models differentiated by the properties of measurements in comparison with CS. The CS measurement matrix does not need training, so it does not have a training loss.

3.3 Optimising Models

The three models we derived as examples in the DCS framework are summarised in Table 1 along side CS. The main difference between them lies is the training objective used for the measurement functions . Once is specified, the generator objective , in the form of measurement error, can be derived follow suit. When and are adversarial, such as in the CS-GAN, and need to be optimised separately as in GANs. This is implemented as the alternating update option in Algorithm 2. In optimising the latent variables (eq. 7), we normalise after each gradient descent step, as in (Bojanowski et al., 2018). We treat the step size in latent optimisation as a parameter and back-propagate through it in optimising the model loss function. An additional technique we found useful in stabilising CS-GAN training is to penalise the distance moves as an optimisation cost and add it to :

(23)

where is a scalar controlling the strength of this regulariser. This regulariser encourages small moves of in optimisation, and can be interpreted as approximating an optimal transport cost (Villani, 2008). We found a range of from to made little difference in training, and used in our experiments with CS-GAN.

4 Experiments

4.1 Deep Compressed Sensing for Reconstruction

We first evaluate the DCS model using the MNIST (Yann et al., 1998) and CelebA (Liu et al., 2015) datasets. To compare with the approach in Bora et al. (2017)

, we used the same generators as in their model. For the measurements functions, we considered both linear projection and neural networks. We considered both random projections and trained measurement functions, while the generator was always trained jointly with the latent optimisation process. Unless otherwise specified, we use 3 gradient descent steps for latent optimisation. More details, including hyperparameter values, are reported in the Appendix. Our code will be available at

https://github.com/deepmind/deep-compressed-sensing.

Tables 2 and 3 summarise the results from our models as well as the baseline model from Bora et al. (2017)

. The reconstruction loss for the baseline model is estimated from Figure 1 in

Bora et al. (2017). DCS performs significantly better than the baseline. In addition, while the baseline model used hundreds or thousands of gradient-descent steps with several re-starts, we only used 3 steps without any re-starting, achieving orders of magnitudes higher efficiency. Interestingly, for fixed , random linear projections outperformed neural networks as the measurement functions in both datasets across different neural network structures (row 2 and 3 of Table 2 and 3). This empirical result is consistent with the optimality of random projections described in the compressed sensing literature and the more general Johnson-Lindenstrauss lemma (Donoho, 2006; Candes et al., 2006; Johnson & Lindenstrauss, 1984).

The advantage of neural networks manifested when was optimised; this variant reached the best performance in all scenarios. As argued in (Weiss et al., 2007), we observed that random projections are sub-optimal for highly structured signals such as images, as seen in the improved performance when optimising the measurement matrices (row 4 of Table 2 and 3). The reconstruction performance was further improve when the linear measurement projections were replaced by neural networks (row 5 of Table 2 and 3). Examples of reconstructed MNIST images from different models are shown in Figure 2.

Unlike autoencoder-based methods, our models were not trained with any pixel reconstruction loss, which we only use for testing. Despite this, our results are comparable with the recently proposed “Uncertainty Autoencoders”

(Grover & Ermon, 2018). We have worse MNIST reconstructions: with 10 and 25 measurements, ours best model achieved 5.3 and 3.4 per-image reconstruction errors compared with theirs 3.8 and 2.5 (estimated from figure 4). However, we achieved better CelebA results: with 20 and 50 measurements, we have errors of 23.4 and 18.5 compared with their 27 and 22 (estimated from Figure 6).


Model 10 25 measurements steps
Baseline
Linear 3
NN 3
Linear(L) 3
NN(L) 3
Table 2: Reconstruction loss on MNIST test data using different measurement functions. All rows except the first are from our models. “

” shows the standard deviation across test samples. (L) indicates learned measurement functions. Lower is better.


Model 20 50 measurements steps
Baseline
Linear 3
NN 3
Linear(L) 3
NN(L) 3
Table 3: Reconstruction loss on CelebA test data using different measurement functions. All rows except the first are from our models. “” shows the standard deviation across test samples. (L) indicates learned measurement functions. Lower is better.
Figure 2: Reconstructions using 10 measurements from random linear projection (top), trained linear projection (middle), and trained neural network (bottom).

4.2 CS-GANs

To evaluate our proposed CS-GANs, we first trained a small model on MNIST to demonstrate intuitively the advantage of latent optimisation. For quantitative evaluation, we trained larger and more standard models on CIFAR10 (Krizhevsky & Hinton, 2009), and evaluate them using the Inception Score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017). To our knowledge, latent optimisation has not been previously used to improving GANs, so our approach is orthogonal to existing methods such as Arjovsky et al. (2017); Miyato et al. (2018). We first compare our model with vanilla GANs, which is a special case of the CS-GAN (see section 3.2.2).

We use the same MNIST architectures as in section 4.1, but changed the the measurement function to a GAN discriminator (section 3.2.2). We use the alternating update option in Algorithm 2 in this setting. All other hyper-parameters are the same as in previous experiments. We use this relatively weak model to reveal failure modes as well as advantages of the CS-GAN. Figure 3 shows samples from models with the same setting but different latent optimisation iterations. The three panels show samples from models using 0, 3 and 5 gradient descent steps respectively. The model using 0 iteration was equivalent to a vanilla GAN. Optimising latent variables exhibits no mode collapse, one of the common failure modes of GAN training.

Figure 3: Samples from CS-GANs using 0 (left), 3 (central) and 5 (right) gradient descent steps in latent optimisation. The CS-GAN using 0 step was equivalent to a vanilla GAN.

To confirm this advantage, we more systematically evaluate our method across a range of 144 hyper-parameters (similar to Kurach et al. (2018)). We use the CIFAR dataset which contains various categories of natural images, whose features from an Inception Network (Ioffe & Szegedy, 2015) are meaningful for evaluating the IS and FID. Other than the number of gradient descent steps (0 vs. 3) the model architectures and training procedures were identical. The change of IS and FID during training are plot in figure 4

. CS-GANs achieved better performance in both IS and FID, and had less variance across the range of hyper-parameters. The blue horizontal lines at the bottom of Fig. 

4 (left) and the top of Fig. 4 (right) shows failed vanilla GANs, but none of the CS-GANs diverged in training.


SN-GAN SN-GAN (ours) CS+SN-GAN
IS
FID
Table 4: Comparison with Spectral Normalised GANs.

We also applied our latent optimisation method on Spectral-Normalised GANs (SN-GANs) (Miyato et al., 2018), which use Batch Normalisation (Ioffe & Szegedy, 2015) for the generator and Spectral Normalisation for the discriminator. We compared our model with SN-GAN in Table 4: the SN-GAN column reproduces the numbers from (Miyato et al., 2018), and the next column are numbers from our replication of the same baseline. Our results demonstrate that deeper architectures, Batch Normalisation and Spectral Normalisation can further improve CS-GAN and that CS-GAN can improve upon a competitive baseline, SN-GAN.

Figure 4: Inception Score (higher is better) and FID (lower is better) during CIFAR training.

4.3 CS-SGANs

We now experimentally assess our approach to use latent optimisation in semi-supervised GANs, CS-SGAN. We illustrate this extension with the MNIST dataset, and leave it to future work to study other applications. We keep all the hyper-parameters the same as in section 4.2, except changing the number of measurements to 11 for the 10 MNIST classes and 1 class reserved for generated samples. Samples from CS-SGAN can be seen in Figure 5 (left). Figure 5 (right) illustrates this with T-SNE (Maaten & Hinton, 2008) computed from 2000 random samples, where class labels are colour-coded. The latent space formed separated regions representing different digits. It is impossible to obtain such clustered latent space in typical conditional GANs (Mirza & Osindero, 2014; Miyato & Koyama, 2018), where labels are supplied as separate inputs while the random source only provides label-independent variations. In contrast, in our model the labels are distilled into latent representation via optimisation, leading to a more interpretable latent space.

Figure 5: Left: samples from the generative classifier, and t-SNE illustration of the generator’s latent space.

5 Discussion

We present a novel framework for combining compressed sensing and deep neural networks. In this framework we trained both the measurement and generation functions, as well as the latent optimisation (i.e., reconstruction) procedure itself via meta-learning. Inspired by Bora et al. (2017), our approach significantly improves upon the performance and speed of reconstruction obtain in this work. In addition, we derived a family of models, including a novel GAN model, by expanding the set of properties we consider for the measurement function (Table 1).

Our method differs from existing algorithms that aim to combine compressed sensing with deep networks in that our approach preserves the online minimisation of measurement errors in generic neural networks. Previous attempts that combine CS and deep learning generally fall into two categories. One category of methods interprets taking compressed measurements and reconstructing from these measurements as an encoding-decoding problem and formulate the model as an autoencoder (Mousavi et al., 2015; Kulkarni et al., 2016; Mousavi et al., 2017; Grover & Ermon, 2018; Mousavi et al., 2018; Lu et al., 2018). Another category of methods are designed to mimic principled iterative CS algorithms using specialised network architectures (Metzler et al., 2017; Sun et al., 2016). In contrast, our framework maintains the separation of measurements from generation but still uses generic neural networks. Therefore, both the measurements and latent representation of the generator can be flexibly optimised for different, even adversarial, objectives, while taking advantage of powerful neural network architectures.

Moreover, we can train measurement functions with properties that are difficult or impossible to obtain from random or hand-crafted projections, thus broadening the range of problems that can be solved by minimising error measurements online. In other words, learning the measurement can be used as a useful stepping stone for learning complex tasks where the cost function is difficult to design directly. Our approach can also be interpreted as training implicit generative models, where explicit minimisation of divergences is replaced by statistical tests (Mohamed & Lakshminarayanan, 2016). We have illustrated this idea in the context of relatively simple tasks, but anticipate that complex tasks such as style transfer (Zhu et al., 2017), in areas already seen the applications of CS, as well as applications including MRI (Lustig et al., 2007)

and unsupervised anomaly detection

(Schlegl et al., 2017), may further benefit from our approach.

Acknowledgments

We thank Shakir Mohamed and Jonathan Hunt for insightful discussions. We also appreciate the feedback from the anonymous reviewers.

Appendix

Appendix A Experiments Detail

Unless otherwise specified, we used the following default configuration for all experiments. We used the Adam optimiser (Kingma & Ba, 2014) with the learning rate and the parameters , . We trained all the models for steps with the batch size of . 100 dimensional latent representations were used for generators. We use 3 gradient-descent steps for latent optimisation, and the initial step size of .

a.1 Reconstruction Experiments

Following Bora et al. (2017)

, we used a 2-layer multi-layer perceptron (MLP), with 500 units in each hidden layer and leaky ReLU non-linearity, as the generator for MNIST images; for CelebA, we used the DCGAN generator

(Radford et al., 2015). In addition to random linear projections, we tested the following neural networks as the measurement functions: a 2-layer MLP with 500 units in each layer and leaky ReLU non-linearity for MNIST, and the DCGAN discriminator for CelebA.

a.2 GAN experiments

We used the same MLP generator and discriminator (i.e., measurement function) as described in the previous section for MNIST experiments. We also use the same architecture for the semi-supervised GAN.

For CIFAR dataset, we used the DCGAN architecture with its recommended Adam parameters , (Radford et al., 2015). We tested a number of hyper-parameters as the cross product of the following: generator learning rates , discriminator learning rates , latent variable sizes , mini-batch sizes . Additionally, 2 replicas for each combination were trained to account for the effect of random seeds.

To reproduce the Spectral Normalised GANs. We used the same discriminator as in Miyato et al. (2018), which is deeper than the DCGAN discriminator. A grid search over optimisation parameters found the learning rate of and Adam’s of most stably achieved the best results.

Inception Scores and Fréchet Inception Distances were reported as the averages of 10 evaluations each based on random samples (Salimans et al., 2016; Heusel et al., 2017).

References