Degeneration in VAE: in the Light of Fisher Information Loss

Variational Autoencoder (VAE) is one of the most popular generative models, and enormous advances have been explored in recent years. Due to the increasing complexity of the raw data and the model architecture, deep networks are needed in VAE models while few works discuss their impacts. According to our observation, VAE does not always benefit from deeper architecture: 1) Deeper encoder makes VAE learn more comprehensible latent representations, while results in blurry reconstruction samples; 2) Deeper decoder ensures more high-quality generations, while the latent representations become abstruse; 3) When encoder and decoder both go deeper, abstruse latent representation occurs with blurry reconstruction samples at same time. In this paper, we deduce a Fisher information measure for the corresponding analysis. With such measure, we demonstrate that information loss is ineluctable in feed-forward networks and causes the previous three types of degeneration, especially when the network goes deeper. We also demonstrate that skip connections benefit the preservation of information amount, thus propose a VAE enhanced by skip connections, named SCVAE. In the experiments, SCVAE is shown to mitigate the information loss and to achieve a promising performance in both encoding and decoding tasks. Moreover, SCVAE can be adaptive to other state-of-the-art variants of VAE for further amelioration.


page 1

page 6


Generating Diverse High-Fidelity Images with VQ-VAE-2

We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) ...

Variational Autoencoders Pursue PCA Directions (by Accident)

The Variational Autoencoder (VAE) is a powerful architecture capable of ...

Robust Vector Quantized-Variational Autoencoder

Image generative models can learn the distributions of the training data...

A Hybrid Convolutional Variational Autoencoder for Text Generation

In this paper we explore the effect of architectural choices on learning...

Generated Loss and Augmented Training of MNIST VAE

The variational autoencoder (VAE) framework is a popular option for trai...

Generated Loss, Augmented Training, and Multiscale VAE

The variational autoencoder (VAE) framework remains a popular option for...

The Usual Suspects? Reassessing Blame for VAE Posterior Collapse

In narrow asymptotic settings Gaussian VAE models of continuous data hav...

1 Introduction

Variational Autoencoder (VAE) (Kingma and Welling, 2013)

is one representative generative model to combine variational inference with deep learning, and has shown great promise in the recent years. This is not only because of its strong ability to reason raw data with meaningful representation, providing possibilities for downstream works such as classification, generation, etc., but also because of its automatic feature learning and inference process with deep learning techniques.

Nowadays, many variants of VAE are proposed to concern the unsupervised latent representation learning to adapt to various tasks. One line of research works break the simple assumption on likelihood in primitive VAE models and introduce the autoregressive density to sequentially model the generation. For example,  Gulrajani et al. (2016); van den Oord et al. (2016a, b) propose PixelCNN and PixelRNN to reconstruct the image pixel by pixel, and utilize the contextual information to constrain the generation. Another line of research works pay attention to the expressiveness of the posterior modeled with VAE. To improve its power, Burda et al. (2015); Tomczak and Welling (2017) introduce more complex hierarchical priors or transform simple priors to the complex ones by the normalization flow and its variants (Rezende and Mohamed, 2015; Kingma et al., 2016; Sønderby et al., 2016). Although previous works make VAE more flexible to adapt various tasks, they also raise some new problems. A typical one is the degeneration when VAE is in the deeper architecture.

Figure 1: Deeper network’s impacts in VAE models. Upper: representation of latent variable; Lower

: ground truth (odd columns) and reconstruction samples (even columns).

As expected, a deeper encoder should be conducive to the learning of useful latent code that well summarize the observations because of the powerful feature learning capacity of deeper feed-forward networks (Zeiler and Fergus, 2014); meanwhile, a deeper decoder should enable the production of generations of higher quality thanks to the better distribution modeling with deep networks (Gulrajani et al., 2016). However, the degeneration is reported to limit the capacity of the deep networks (Saxe et al., 2013; He et al., 2016). Different from the deep networks, in VAE, as shown in Figure 1, the degeneration occurs in three forms: Compared to VAE in shallow architecture, 1) a deeper decoder enables VAE to produce generations of higher quality, while the latent representation (visualized with t-SNE (Maaten and Hinton, 2008)) is shown useless in providing high-level summary of the observation; 2) a deeper encoder helps the latent representation learn more global information, while the generation is of low quality; 3) when encoder and decoder both go deeper, VAE fails in both latent representation and generation. It seems that the degeneration does not affect VAE as the deep networks, but to harm the correlation between the input and the latent code.

The above phenomena motivate us to trace back to the connection between the input and the latent code. We illustrate the autoencoding process, in our paper, as a process of information transmission. Although in some previous work like Zhao et al. (2017) the mutual information has been proposed to enhance the connection between data and latent code, as the encoder and decoder go deeper, mutual information between data and latent code is more difficult to maintain. A natural solution is to investigate the information propagation layer by layer, which yet brings difficulty to the mutual information measure. Therefore, considering the output of hidden layers as a parametric implicit distribution, we propose a Fisher Information (Brunel and Nadal, 1998) measure to quantify the information loss layer by layer. With such measure, we demonstrate that the information loss generally exists in the encoder and decoder, which results in poor connection between latent code and data, thus leading to the previous three types of degeneration. In addition, we demonstrates that skip connections (He et al., 2016; Huang et al., 2016) could serve as a complementary information flow to help mitigate the degeneration without increasing the model complexity. Thus a variant SCVAE, i.e., VAE with skip connections, is proposed to preserve information when encoder and decoder go deeper. Finally, we conduct a series of experiments on widely used MNIST dataset. Comprehensive results indicate that our model performs well in information preservation, thus ensures a promising performance in latent representation learning and reconstruction at same time. Moreover, our model can be adaptive to other state-of-the-art VAE models for further amelioration.

2 Related work

In this section, we first review the recent progress of Variational AutoEncoder, which improves VAE in two perspectives: the expressiveness of likelihood and the expressiveness of posterior. Then we review some relevant works that address the degeneration of deep networks and the similarity to VAE.

2.1 Expressiveness of Likelihood

Recently, research on a more expressive decoder has been conducted to improve the generative performance of VAE. In the research that applies VAE to sequence modeling, a powerful decoder is proved to be more expressive (Chung et al., 2015)

. Numerous research works combine recurrent and autoregressive models to achieve a powerful decoder:  

Germain et al. (2015) proposed MADE, which masks the autoencoder’s parameters to respect autoregressive constraints;  Gregor et al. (2015) proposed a recurrent structure to gradually reconstruct observations focusing on regions of interest. Gulrajani et al. (2016); Salimans et al. (2017)

model the dependencies among pixels with autoregressive density estimator

e.g., PixelCNN and PixelRNN  (van den Oord et al., 2016b, a) to serve as an expressive conditional distribution.

2.2 Expressiveness of Posterior

Meanwhile, some works focus on augmenting the expressiveness of VAEs to model the complex posterior. In order to avoid using too simplistic priors, many research works propose to use multimodal distribution such as Gaussian mixture (Dilokthanakul et al., 2016). An alternative way is we first apply a simple prior, but complicate it gradually: normalization flow (Rezende and Mohamed, 2015; Kingma et al., 2016) is thus introduced to transform the variational distribution into more complex ones by applying successive invertible smooth transformation. Other methods design hierarchical latent variable structure (Sønderby et al., 2016) to approximate a complex posterior, or deploy auxiliary information such as label, “Maximum Mean Discrepancy”, pseudo-input, etc. to gradually increase the flexibility of posterior (Kingma et al., 2014; Louizos et al., 2015; Tomczak and Welling, 2017).

2.3 Degeneration in Networks

It is noteworthy that neural networks and VAE possess similarities and differences when going deeper.

He et al. (2016); Huang et al. (2016)

point out the vanishing-gradient problem that prevents the neural network from going deeper, and introduce residual connections to help training of very deep neural networks.

Saxe et al. (2013); Orhan and Pitkow (2018) point out deep neural network is defective by degeneration and claim degeneration occurs when networks lack sufficient information to provide for learning dynamics, which corresponds to the third degeneration observed in Figure 1. Apart from this phenomenon, the other two degeneration problems are similar to information preference (Chen et al., 2016; Zhao et al., 2017). Different from that, deeper encoder or decoder is supposed to be a powerful approximator of distribution, but the expressiveness does not correspond to our expectation. Fisher Information can be applied to measure the quality of parameters in neural networks as mentioned in Desjardins et al. (2015); Ollivier (2015). Inspired by these works, we investigate the observed problems in perspective of Fisher Information measure and demonstrate the existence of information loss in deep VAE. Skip connection is applied as a solution for information preservation, but not for avoiding gradient vanishing problem (He et al., 2016).

3 Degeneration in VAE

In this section, we first give a brief review of VAE. Then we present the observed degeneration problems shown in Figure 1, which obstruct VAE from going deeper.

As we know, the goal of VAE is to reason the data with the latent variables by marginalization (Kingma and Welling, 2013):


where is the parameter of the model and is the number of datdapoints. However, Eq. (1) is usually intractable due to the lack of the analytical form for the integration. The common way to solve this problem is to introduce an evidence lower bound (ELBO):


which is applied as an optimization objective so as to maximize the log-likelihood by introducing an inference model (also called recognition model) parameterized with . consists of two terms: the first term is to fit the data, called reconstruction term, and the remaining term is to fit the prior, called KL-divergence term. When such lower bound is sufficiently optimized, the log-likelihood is approximately maximized.

The advantage of VAE lies on combining the variational inference with deep learning. Networks are applied to model the posterior and conditioned likelihood , named encoder, decoder respectively. When a network go deeper, the modeling capacity is supposed to be more powerful. Hence, the latent presentation and generation quality are supposed to be improved when VAE goes deeper.

However, this conjecture is not exactly in accord in the context of three types of degeneration shown in Figure 1. We observe the latent code (visualized by T-SNE (Maaten and Hinton, 2008)) and the generation respectively. The shallow one is a typical Valina VAE (Kingma and Welling, 2013). We extend the depth of encoder/decoder of this referenced model. When the encoder is deepened, we observe that the visualization of latent code becomes more compact, which brings us an intuition that the model well summarizes high-level information, while the generation is of worse quality. When we only extend decoder depth, the result is in reverse. We observe generation of higher quality, while the latent code seems more abstruse. We thus expect to extend both sides to avoid this imbalance. Unfortunately, both latent code and generation become of worse quality.

Concretely, the third type of degeneration is equivalent to the degeneration in neural networks, which occurs due to the lack of information for the learning in networks (Saxe et al., 2013). When VAE degenerates in this way, we can observe it is hardly optimized during training. The other two degeneration problems remind us of the information preference problem (Chen et al., 2016; Zhao et al., 2017). Different from our work, they use mutual information between data and latent code to enhance the meaningfulness of latent code. However, only referring to the mutual information between both ends is not enough as the architecture depth increases. The information transmission through the intermediate layers is worthy to concern. Moreover, in deep architecture, the mutual information is intractable for layer-wise computation, since the distribution form modeled by hidden layers is implicit, though parametric in most case. To address these concerns, we propose Fisher Information as a parametric measure, which will be discussed in the next section.

4 Fisher Information Loss Analysis

In this section, we first introduce the notion of Fisher Information, which is useful in information theory. Then we illustrate the VAE as an information transmission process and analyze the above degeneration in the light of the Fisher Information.

4.1 Review of Fisher Information

The Fisher Information is an important quantity in information theory and can be applied to measure the quality of parametric estimation of distributions (Brunel and Nadal, 1998).

When we consider a stochastic variable , whose probabilistic density function is , parameterized by , we need to estimate the parameter from the measured values of the Variable (named observations). Suppose that the true value of parameter is . The estimation corresponds to the choice of the density that minimize the relevant entropy w.r.t. the true distribution by a divergence:


In the information theory, the divergence in form (3) is positive, convex and become zero when . Suppose its secondary derivative exists. When the divergence reach its optimal, the first order derivative is zero and the secondary derivative at is defined as Fisher Information (Brunel and Nadal, 1998):


Fisher Information is thus not a function w.r.t. the stochastic variable , but a function w.r.t. the probabilistic density and useful for parametric estimation of distributions.

One important characteristic of Fisher Information is that larger Fisher Information implies better understanding of the parameter, which facilitates the parameter estimation: Considering the curve of the divergence , which is convex, larger Fisher Information makes the curve more “steep” (e.g. when , it becomes a dirac centered on ), and it becomes easier to reach the optimal . In this way, it reflects the quality of parameters regarding the approximation between modeled distribution and the true distribution .

4.2 ELBO as Information Transmission

For the simplicity and clarity of the formulation, we make two following assumptions: First, we suppose all the stochastic variables are continue and have a probabilistic density. Second, we suppose that the probabilistic density functions are sufficiently regular, i.e. they are continuously derivable and tend to zero at infinity (also for their first order derivative)111These hypothesis can be cancelled out by mathematical techniques to meet the request of the real-world situation..

Suppose that our data is a set of samples, , where is the dimension of one data sample. The measurable space of latent variable is of dimension , i.e. . According to the objective of ELBO, mentioned in Eq. (2), the VAE models and represents the data samples with the following process:

where is the sample reconstructed from the latent code . This process is implemented as an autoencoder in Kingma and Welling (2013) but ignore the modeling of hidden layers.

To be more detailed, the impacts of intermediate layers of encoder and decoder is naturally introduced. Therefore, by noting the output of the hidden layer as (, L is the depth of network), the encoding (resp. decoding) process can be illustrated as:

Since the neural networks possess their probabilistic interpretation (Bishop, 2006)

(for example, the output of a MLP with linear activation can be interpreted as the mean of a conditional Gaussian distribution with a fixed variance

(Pascanu and Bengio, 2013)), we model the output of one hidden layer (e.g. the layer) by using a stochastic variable :

note that since the the network is not always a MLP, nor with linear activation, the distribution is not necessarily of Gaussian form, while can be regarded as implicit distribution.

Therefore, based on the detailed auto-encoding process and the probabilistic view of hidden layers, the variational distribution (resp. generative distribution ) can be reformulated as:


From the from (5), we can learn that the information transmission is not only dependent on data and latent code , but also the intermediate layers . In previous work, such as Zhao et al. (2017), Mutual Information is measured to reinforce the connection between and . However, the relation between and can only partially reflects how information evolves through the hidden layers in the information transmission process. Plus, as the architecture becomes deeper, the measure between and is more complex to compute. These issues make the degeneration even harder to address.

To address these concerns, we transform the non-parametric measure to parametric in order to investigate the layer-wise information. Thus, using Fisher Information, we can evaluate the information propagation quality through the encoder and decoder. Since Fisher Information is a parametric-wise measure, we can evaluate the quality of variational distribution and generative distribution w.r.t. the network parameters and . In the light of Fisher Information, we analyze the degeneration as a phenomenon of information loss, which will be discussed in the next part.

4.3 Degeneration Analysis with Fisher Information

Fisher Information has been applied for efficient gradient backpropagation and the exact computation over layer-wise parameters can be achieved in neural networks

(Ollivier, 2015; Desjardins et al., 2015). In VAE models, encoding and decoding networks are applied to compute and (Kingma and Welling, 2013). We further generalize these distribution as and , in consideration of the impacts of hidden layer output. To investigate the quality of parameters and in these distribution, Fisher Information is thus computed over parameters of the network and we discover the information loss, that causing the degeneration phenomena, as shown in the following proposition: Suppose we apply a feed-forward network with layers to approximate a distribution , parameterized by :

the network represents an approximated distribution . To evaluate the information evolution through the network, the compute Fisher Information in and layer can be computed and deduced as:


where is the backpropagated gradient through the non-linearity. Note that the network can be either encoder or decoder. The corresponding input (resp. parameter) can correspondingly be or latent code (resp. or ). By definition in Eq. (4), the Fisher Information passed through the layer can be written as:

Similarly, for layer , we compute the Fisher Information by definition and have:


In Eq. (7), the term is proved zero in many works of information theory (Brunel and Nadal, 1998; Ly et al., 2017)

. Thus we can only consider the first term. Additionally, between epoch

in gradient descend optimization, , where is the learning rate. Then we have . Finally we have:

Using proposition 4.3, we can interpret the information transmission through the network by evaluating the gradient propagated through the network as a remark:

Many works such as Saxe et al. (2013) have reported that the gradient tends to get smaller as we move backward through the hidden layer.

Using Eq. (6)m we have thus:

which indicates that deeper layers tend to obtain less information layer by layer.

It is interesting to notice the difference between a typical feed-forward neural network and VAE in perspective of information loss. Actually, information loss widely exists in deep neural networks, but it is often ignored. In fact, deep networks is powerful in learning hierarchical features

(Zeiler and Fergus, 2014), with the learning process that tends to make features compact and discard superfluous information. This process is widely tolerated in networks’ tasks though risky in degeneration in some cases.

However, VAE cannot simply go deeper as networks. The loss of information makes VAE difficult to reach the true parameter and . Recall the ELBO in Eq. (2), both two terms show the dependency between and : the reconstruction request to compute the expectation of w.r.t. ; meanwhile, the KL divergence connect and . Therefore, either inaccurate parameter estimation of or will mislead the model’s learning balance between and . As a result, when facing the information loss, VAE needs to pick a choice between useful latent code and high-quality generation.

5 Fisher Information Preservation

As discussed, the information loss is ineluctable in VAE and causes degeneration in deep architecture. A natural solution is thus preserving information without changing the parameter structure. In this section, we propose a simple but effective way for information preservation in VAE.

The skip connections (He et al., 2016) can skip one or more layers of nonlinear mapping without changing the parameter dimension. Moreover, we demonstrate them as complementary information flows in this section. Thus, we propose a class of VAE equipped with skip connections, named SCVAE, to preserve the information in this way.

As discussed, the output of a hidden layer in the neural network can be regarded as a stochastic variable and has a probabilistic distribution. Formally, we pose the stochastic variable to model the output of hidden layer, whose probabilistic density function is an implicit density function represented by the network:


When equipped with a skip connection (we use the stochastic variable to present) which skips

layers, the output contains information from both former layer and the skip connection, the output of this layer thus is presented by a set of jointly distributed random variables

(, ):

We analyze the Fisher Information of output in order to find out if skip connections contribute to information preservation. The evaluation of Fisher Information thus becomes

. Since Fisher Information is always greater than or equal to zero, we can expand as follow by its chain rule

(Zegers, 2015) and deduce Proposition 5:


Suppose the output of the hidden layer parameterized by receives information from the former layer and outputs the distribution:


Modeling with Fisher Information, when connected with skip connections, this layer shall receive more information compared with non-skip architecture:


where the skip connection passes information from layer by skipping layers (). We only need to prove that is not zero in Eq. (9). According to the theorem of chain rule (Zegers, 2015), for the inequality in Eq. (9), the equality sign holds if and only if and are independent:

We have in Eq. (8), which indicates that and are not independent because they are both dependent to and we have:

Thanks to Proposition 5, the skip connection can be regarded as a complementary information flow between layers. Following this idea, we propose a VAE model equipped with skip connections, named SCVAE. We make SCVAE skip one or more layers to keep information amount as rich as possible. Our model with one-layer skipping connections can be described by the following equations:


where indicates the layer’s mapping in the neural network, is a is a down-sampling or up-sampling function, , , and is the depth of inference network. The model could also include long-skipping-distance connections which skip multiple layers to strengthen the sharing between low-level and high-level features, described as:


where and refers to encoder/decoder.

In this way, SCVAE is essentially designed for information preservation. Skip connections as a simple method to preserve information flow, do not increase computation complexity and is compatible with many models as shown in experiments.

6 Experimental Results

In the following, we implement the experiments to verifying below three questions:

  • Whether the skip connections contribute to the information preservation in VAE models, especially in deep VAE architecture.

  • Whether the observed degeneration gets mitigated as information is preserved, thus improves VAE model’s performance when going deeper.

  • Whether SCVAE is compatible with other method to reach a further amelioration.

The experiments are conducted on the MNIST dataset that consists of ten categories of 2828 hand-written digits. We follow the standard split 50,000/10,000/10,000 to partition the dataset as the training, validation and test parts.

Plain VAE and SCVAE are implemented using MLPs with layers of 500 parameters. The shallow VAE model is of depth 1 hidden layer. When we make the model’s encoder (resp. decoder) deeper, we note as q++ (resp. p++). Otherwise encoder has the same depth with decoder. We also use a model SCVAE-L which only modifies SCVAE to contain only one long-skipping-distance connection in encoder to demonstrate the effect of long-skipping-distance connection. For all experiments, the dimension of latent space is set to 50. Quantitative results are presented with averaged values and Fisher Information is computed as noted in Desjardins et al. (2015).

6.1 Fisher Information Preservation

As discussed in Section 4, the information loss is one principle factor that obstructs VAE from going deeper. In this part, we evaluate how information amount decays as VAE goes deeper and their corresponding changes in encoder and decoder. From this perspective, we present the problems that deep VAE faces and how SCVAE overcomes these limitations to make VAE models deeper.

In the first experiment, we investigate the impact of depth on information amount in VAE. We respectively make their encoder and decoder deeper and compute their mean Fisher Information among layers. Figure 3 presents how information amount varies w.r.t. VAE depth. When VAE extends either encoder or decoder to go deeper, the average information amount keeps a decreasing tendency. Different from plain VAE, SCVAE could generally maintain information amount close to the same level. Although these two models have similar amount of information in shallow architecture, as the model goes deeper, SCVAE remains much richer information amount than plain VAE.

Figure 2: Mean Fisher Information w.r.t depth in VAE.
Figure 3: Mean Fisher Information in different VAE models.

In the next experiment, we fix the depth for deepened part to 11-hidden-layer depth. For SCVAE-L, it keeps the same architecture as SCVAE, but only one long-skipping-distance connection exists in its encoder to connect the first and last hidden layer. We respectively compute the average Fisher Information in encoder and decoder, as shown in Figure 3. Deep VAE remains little information amount in the model, which corresponds to third type of degeneration mentioned in Section 4. When encoder goes deeper, information amount mainly decays in decoder; in reverse, information amount decays more in encoder when decoder goes deeper. This refers to the other two types of degeneration problems, indicating that blurry samples are caused by the lack of information in decoder, while abstruse latent representations are caused by the lack of information in encoder. In SCVAE-L, it is interesting that information amount in encoder is less than in SCVAE but richer than in VAE(p++), which implies that the model leverages the long-skipping-distance connection and augment the information amount in encoder but the capacity of preservation is finite. The advantages of skip connection in information preservation is shown in SCVAE, where we observe that SCVAE maintains the closest mean information amount to the shallow model, as well as the ratio between information amount in encoder and in decoder.

In these two experiments, we verify our claims in Section 4 and 5. When going deeper, VAE models tend to lose more information. We could thus associate information amount to phenomena in Figure 1. Information loss in decoder leads to blurry reconstruction samples, while loss in encoder leads to abstruse latent presentation.

6.2 Degeneration mitigation

Previous experiments demonstrate VAE should carefully go deeper in case of information loss. In this part, we return to VAE tasks, i.e.

, representation learning and generation, in order to verify whether SCVAE mitigates the degeneration. We evaluate the representation learning with classification accuracy. A simple SVM (Support Vector Machine) is trained and test with the learned latent representation. As for generation, we evaluate with negative log-likelihood (NLL). All models keep the same architecture as in previous experiment.

Table 1 suggests going deeper results in an improvement in a specific task, though the other task performance suffers risk in degeneration: VAE(q++) achieves a better classification result, but sacrifices the NLL performance; VAE(p++) outperforms in NLL but under-performs in classification. Comparing to plain VAE, models with skip connections (SCVAE, SCVAE-L) achieve well performance in both tasks. Especially, SCVAE achieves the best result in both two tasks.

Model NLL Acc
VAE(1L)(Kingma and Welling, 2013) 87.89 0.8421
VAE(11L) 206.09 0.1135
VAE(q++) 91.13 0.9352
VAE(p++) 81.59 0.7120
SCVAE 80.19 0.9588
SCVAE-L 84.02 0.9216
Table 1: Test negative log-likelihood (NLL) and classification accuracy on MNIST
Figure 4: Left: representation visualization of raw data, first layer output, intermediate layer output, last layer output of encoder, and latent space (from left to right). Right: ground truth (odd columns) and reconstruction (even columns).

In Figure 4, we present qualitative results of five deep models to have an intuitive understanding of the corresponding performance. As we analyzed in previous part, models suffer from degeneration due to the lack of information in encoder or decoder. When degenerated in encoder (VAE, VAE(p++)), the latent representation degenerates layer by layer in encoder; when degenerated in decoder (VAE, VAE(q++)), reconstructions are not only blurry but also contain incorrect digits. These phenomena explicates that the degenerated VAE does not connect the global and detailed information. When free from degeneration, SCVAE benefits from deep architecture to produce more clear reconstructions and to learn a compact representation. Recall that SCVAE-L contains less information in encoder than SCVAE and VAE(q++) (Figure 3), we notice that the intermediate layers show abstruse presentation, implying the information decays in these layers.

In this experiment, we demonstrate that going deeper could benefit VAE models in specific task without information loss. Specifically, SCVAE is free from any type of degeneration, thus achieves the best performance among the previous models in both representation learning and generation.

6.3 Combination with state-of-the-art

To test the compatibility of deep SCVAE and other advancements in VAE, we respectively combine SCVAE with PixelVAE (Gulrajani et al., 2016) (with 8 pixel layers here) by concatenation, and with VampPrior (Tomczak and Welling, 2017) by substitution of Gaussian prior with VampPrior. SCVAE remains the same architecture as previous part.

In Figure 5, we notice that our Fisher Information measure could also reflect the characteristics of state-of-the-art: we observe information in encoder is negligible comparing with decoder, which refers to the latent code ignorance problem in PixelVAE (Chen et al., 2016). VampVAE has a more expressive posterior, thus performs better in latent coding and maintains more information in encoder. When combining with these methods, SCVAE improves their strength since SCVAE could be regarded as a powerful approximator for posterior and likelihood; SCVAE also remedies their shortcoming to a certain extent, by providing with more information amount.

Table 2 suggests that SCVAE has a comparable performance with the state-of-the-art. When combining with VampPrior and PixelVAE, SCVAE reinforces the power of these models, achieving a better performance than before. Notably, when combine all these methods, the performance of the final model becomes promising and competitive in both representation learning and generation.

Figure 5: Fisher Information measure in advanced models. Table 2: Combination of SCVAE with state-of-the-art: Negative Log-Likelihood and classification accuracy on MNIST Model NLL Acc SCVAE 80.19 0.9588 PixelVAE 79.48 0.5148 VAE(1L) + VampPrior 82.32 0.9628 SCVAE+VampPrior 81.63 0.9839 SCVAE + PixelVAE 79.35 0.7776 SCVAE + PixelVAE + VampPrior 79.26 0.9784

7 Conclusions

In this paper, we investigate how deep architecture affects VAE models. Our observation shows that deeper architecture does not always benefit VAE performance due to three types of degeneration. In further analysis with our Fisher Information measure, we discover that the information loss is ineluctable for feed-forward networks and harms deep VAE with degeneration problems. Moreover, skip connections are proved to contribute in the information preservation without changing parameter structure. We thus propose a class of VAEs enhanced by skip connection, named SCVAE for information preservation and degeneration mitigation.

The experiments demonstrate the following advantages of SCVAE: 1) SCVAE maintains richer information to avoid the degeneration when going deeper; 2) SCVAE takes advantage of going deeper and achieve better performance in VAE’s tasks, such as representation learning, generation, etc.; 3) SCVAE is compatible with other advanced VAE models for further improvement. Hence, SCVAE is promising in deep VAE and could be regarded as an appropriate design of the model.