Log In Sign Up

Controllable Variational Autoencoder

Variational Autoencoders (VAE) and their variants have been widely used in a variety of applications, such as dialog generation, image generation and disentangled representation learning. However, the existing VAE models have some limitations in different applications. For example, a VAE easily suffers from KL vanishing in language modeling and low reconstruction quality for disentangling. To address these issues, we propose a novel controllable variational autoencoder framework, ControlVAE, that combines a controller, inspired by automatic control theory, with the basic VAE to improve the performance of resulting generative models. Specifically, we design a new non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to automatically tune the hyperparameter (weight) added in the VAE objective using the output KL-divergence as feedback during model training. The framework is evaluated using three applications; namely, language modeling, disentangled representation learning, and image generation. The results show that ControlVAE can achieve better disentangling and reconstruction quality than the existing methods. For language modelling, it not only averts the KL-vanishing, but also improves the diversity of generated text. Finally, we also demonstrate that ControlVAE improves the reconstruction quality of generated images compared to the original VAE.


page 9

page 17


ControlVAE: Tuning, Analytical Properties, and Performance Analysis

This paper reviews the novel concept of controllable variational autoenc...

Challenging β-VAE with β< 1 for Disentanglement Via Dynamic Learning

This paper challenges the common assumption that the weight of β-VAE sho...

Soft-IntroVAE: Analyzing and Improving the Introspective Variational Autoencoder

The recently introduced introspective variational autoencoder (IntroVAE)...

The Transitive Information Theory and its Application to Deep Generative Models

Paradoxically, a Variational Autoencoder (VAE) could be pushed in two op...

GCVAE: Generalized-Controllable Variational AutoEncoder

Variational autoencoders (VAEs) have recently been used for unsupervised...

Variational Autoencoders Without the Variation

Variational autoencdoers (VAE) are a popular approach to generative mode...

AS-IntroVAE: Adversarial Similarity Distance Makes Robust IntroVAE

Recently, introspective models like IntroVAE and S-IntroVAE have excelle...

1 Introduction

This paper proposes a novel controllable variational autoencoder, ControlVAE, that leverages automatic control to precisely control the trade-off between data reconstruction accuracy bounds (from a learned latent representation) and application-specific constraints, such as output diversity or disentangled latent factor representation. Specifically, a controller is designed that stabilizes the value of KL-divergence (between the learned approximate distribution of the latent variables and their true distribution) in the VAE’s objective function to achieve the desired trade-off, thereby improving application-specific performance metrics of several existing VAE models.

The work is motivated by the increasing popularity of VAEs as an unsupervised generative modeling framework that learns an approximate mapping between Gaussian latent variables and data samples when the true latent variables have an intractable posterior distribution  Sohn et al. (2015); Kingma and Welling (2013). Since VAEs can directly work with both continuous and discrete input data Kingma and Welling (2013), they have been widely adopted in various applications, such as image generation Yan et al. (2016); Liu et al. (2017), dialog generation Wang et al. (2019); Hu et al. (2017), and disentangled representation learning Higgins et al. (2017); Kim and Mnih (2018).

Popular VAE applications often involve a trade-off between reconstruction accuracy bounds and some other application-specific goal, effectively manipulated through KL-divergence. For example, in (synthetic) text or image generation, a goal is to produce new original text or images, as opposed to reproducing one of the samples in training data. In text generation, if KL-divergence is too low, output diversity is reduced Bowman et al. (2015), which is known as the KL-vanishing problem. To increase output diversity, it becomes advantageous to artificially increase KL-divergence. The resulting approximation was shown to produce more diverse, yet still authentic-looking outputs. Conversely, disentangled representation learning Denton and others (2017) leverages the observation that KL-divergence in the VAE constitutes an upper bound on information transfer through the latent channels per data sample Burgess et al. (2018). Artificially decreasing KL-divergence (e.g., by increasing its weight in a VAE’s objective function, which is known as the -VAE) therefore imposes a stricter information bottleneck, which was shown to force the learned latent factors to become more independent (i.e., non-redundant), leading to a better disentangling. The above examples suggest that a useful extension of VAEs is one that allows users to exercise explicit control over KL-divergence in the objective function. ControlVAE realizes this extension.

We apply ControlVAE to three different applications: language modeling, disentangling, and image generation. Evaluation results on real-world datasets demonstrate that ControlVAE is able to achieve an adjustable trade-off between reconstruction error and KL-divergence. It can discover more disentangled factors and significantly reduce the reconstruction error compared to the -VAE Burgess et al. (2018) for disentangling. For language modeling, it can not only completely avert the KL vanishing problem, but also improve the diversity of generated data. Finally, we also show that ControlVAE improves the quality of synthetic image generation via slightly increasing the value of KL divergence compared with the original VAE.

2 Preliminaries

The objective function of VAEs consists of two terms: log-likelihood and KL-divergence. The first term tries to reconstruct the input data, while KL-divergence has the desirable effect of keeping the representation of input data sufficiently diverse. In particular, KL-divergence can affect both the reconstruction quality and diversity of generated data. If the KL-divergence is too high, it would affect the accuracy of generated samples. If it is too low, output diversity is reduced, which may be a problem in some applications such as language modeling Bowman et al. (2015) (where it is known as the KL-vanishing problem).

To mitigate KL vanishing, one promising way is to add an extra hyperparameter in the VAE objective function to control the KL-divergence via increasing from until to

with sigmoid function or cyclic function 

Liu et al. (2019). These methods, however, blindly change without sampling the actual KL-divergence during model training. Using a similar methodology, researchers recently developed a new -VAE (Higgins et al. (2017); Burgess et al. (2018) to learn the disentangled representations by controlling the value of KL-divergence. However, -VAE suffers from high reconstruction errors Kim and Mnih (2018), because it adds a very large in the VAE objective so the model tends to focus disproportionately on optimizing the KL term. In addition, its hyperparameter is fixed during model training, missing the chance of balancing the reconstruction error and KL-divergence.

The core technical challenge responsible for the above application problems lies in the difficulty to tune the weight of the KL-divergence term during model training. Inspired by control systems, we fix this problem using feedback control. Our controllable variational autoencoder is illustrated in Fig. 1. It samples the output KL-divergence at each training step , and feeds it into an algorithm that tunes the hyperparameter, , accordingly, aiming to stabilize KL-divergence at a desired value, called the set point.

We further design a non-linear PI controller, a variant of the PID control algorithm Åström et al. (2006), to tune the hyperparameter . PID control is the basic and most prevalent form of feedback control in a large variety of industrial Åström et al. (2006) and software performance control Hellerstein et al. (2004) applications. The basic idea of the PID algorithm is to calculate an error, , between a set point (in this case, the desired KL-divergence) and the current value of the controlled variable (in this case, the actual KL-divergence), then apply a correction in a direction that reduces that error. The correction is applied to some intermediate directly accessible variable (in our case, ) that influences the value of the variable we ultimately want to control (KL-divergence). In general, the correction computed by the controller is the weighted sum of three terms; one changes with error (P), one changes with the integral of error (I), and one changes with the derivative of error (D). In a nonlinear controller, the changes can be described by nonlinear functions. Note that, since derivatives essentially compute the slope of a signal, when the signal is noisy, the slope often responds more to variations induced by noise. Hence, following established best practices in control of noisy systems, we do not use the derivative (D) term in our specific controller. Next, we introduce VAEs and our objective in more detail.

Figure 1: Framework of ControlVAE. It combines a controller with the basic VAE framework to stabilize the KL divergence to a specified value via automatically tuning the weight in the objective.

2.1 The Variational Autoencoder (VAE)

Suppose that we have a dataset of i.i.d. samples that are generated by the ground-truth latent variable , interpreted as the representation of the data. Let denote a probabilistic decoder

with a neural network to generate data

given the latent variable . The distribution of representation corresponding to the dataset is approximated by the variational posterior, , which is produced by an encoder with a neural network. The Variational Autoencoder (VAE) Rezende et al. (2014); Kingma and Welling (2013) has been one of the most popular generative models. The basic idea of VAE can be summarized in the following: (1) VAE encodes the input data samples into a latent variable as its distribution of representation via a probabilistic encoder, which is parameterised by a neural network. (2) then adopts the decoder to reconstruct the original input data based on the samples from

. VAE tries to maximize the marginal likelihood of the reconstructed data, but it involves intractable posterior inference. Thus, researchers adopt backpropagation and stochastic gradient descent 

Kingma and Welling (2013) to optimize its variational lower bound of log likelihood Kingma and Welling (2013).


where is the prior distribution, e.g., standard Gaussian. and denote the distribution parameterized by a neural network with the corresponding parameter and , respectively. The first term in (1) is reconstruction term while the latter term is called KL-divergence. In addition, a reparameterization trick is used to calculate the gradient of lower bound with respect to  Ivanov et al. (2018). It is defined by , where .

However, the basic VAE models cannot explicitly control the KL-divergence to a specified value. They also often suffer from KL vanishing (in language modeling Bowman et al. (2015); Liu et al. (2019)), which means the KL-divergence becomes zero during optimization. To remedy this issue, one popular way is to add a hyperparameter on the KL term Bowman et al. (2015); Liu et al. (2019), and then gradually increases it from until . However, the existing methods, such as KL cost annealing and cyclical annealing Bowman et al. (2015); Liu et al. (2019), cannot totally avert KL vanishing because they blindly vary the hyperparameter during model training.

2.2 -Vae

-VAE Higgins et al. (2017); Chen et al. (2018a) is an extension to the basic VAE framework, often used as an unsupervised method for learning a disentangled representation of the data generative factors. A disentangled representation, according to the literature Bengio et al. (2013), is defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Compared to the original VAE, -VAE adds an extra hyperparameter as a weight of KL-divergence in the original VAE objective (1). It can be expressed by


In order to discover more disentangled factors, researchers further put a constraint on total information capacity, , to control the capacity of the information bottleneck (KL-divergence) Burgess et al. (2018). Then Lagrangian method is adopted to solve the following optimization problem.


where is a large hyperparameter (e.g., 100).

However, one drawback of -VAE is that it obtains good disentangling at the cost of reconstruction quality. When the weight is large, the optimization algorithm tends to optimize the second term in (3), leading to a high reconstruction error.

The above background suggests that a common challenge in applying VAEs (and their extensions) lies in appropriate weight allocation among the reconstruction accuracy and KL-divergence in the VAEs objective function. As mentioned earlier, we solve this using a nonlinear PI controller that manipulates the value of the non-negative hyperparameter, . This algorithm is described next.

3 The ControlVAE Algorithm

During model training, we sample the output KL-divergence, which we denote by , at training step . The sampled KL-divergence is then compared to the set point, , and the difference, then used as the feedback to a controller to calculate the hyperparameter . ControlVAE can be expressed by the following variational lower bound:


When KL-divergence drops below the set point, the controller counteracts this change by reducing the hyperparameter (to reduce penalty for KL-divergence in the objective function (4)). The reduced weight, , allows KL-divergence to grow, thus approaching the set point again. Conversely, when KL-divergence grows above the set point, the controller increases (up to a certain value), thereby increasing the penalty for KL-divergence and forcing it to decrease. This effect is achieved by computing using Equation (5), below, which is an instance of nonlinear PI control:


where and are the constants. The first term (on the right hand side) ranges between and thanks to the exponential function . Note that when error is large and positive (KL-diverge is below set point), the first term approaches 0, leading to a lower that encourages KL-divergence to grow. Conversely, when error is large and negative (KL-divergence above set point), the first term approaches its maximum (which is ), leading to a higher that encourages KL-divergence to shrink.

The second term of the controller sums (integrates) past errors with a sampling period (one training step in this paper). This creates a progressively stronger correction (until the sign of the error changes). The negative sign ensures that while errors remain positive (i.e., when KL-divergence is below set point), this term continues to decrease, whereas while errors remain negative (i.e., when KL-divergence is above set point), this term continues to increase. In both cases, the change forces in a direction that helps KL-divergence approach the set point. In particular, note that when the error becomes zero, the second term (and thus the entire right hand side) stops changing, allowing controller output, , to stay at the same value that hopefully caused the zero error in the first place. This allows the controller to “lock in" the value of that meets the KL-divergence set point. Finally, is an application-specific constant. It effectively shifts the range within which is allowed to vary. This PI controller is illustrated in Fig. 2.

Figure 2: PI controller. It uses the output KL-divergence at training step as the feedback to the PI algorithm to compute .

3.1 PI Parameter Tuning for ControlVAE

One challenge of applying the PI control algorithm lies how to tune its parameters, and effectively. While optimal tuning of nonlinear controllers is non-trivial, in this paper we follow a very simple rule: tune these constants to ensure that reactions to errors are sufficiently smooth to allow gradual convergence. Let us first consider the coefficient . Observe that the maximum (positive) error occurs when actual KL-divergence is close to zero. In this case, if is the set point on KL-divergence, then the error, , is approximated by . When KL-divergence is too small, the VAE does not learn useful information from input data Liu et al. (2019). We need to assign a very small non-negative value, so that KL-divergence is encouraged to grow (when the resulting objective function is optimized). In other words, temporarily ignoring other terms in Equation (5), the contribution of the first term alone should be sufficiently small:


where is a small constant (e.g., in our implementation). The above (6) can also be rewritten as . Empirically, we find that leads to good performance and satisfies the above constraint.

Conversely, when the actual KL-divergence is much larger than the desired value , the error becomes a large negative value. As a result, the first term in (5) becomes close to a constant, . If the resulting larger value of is not enough to cause KL-divergence to shrink, one needs to gradually continue to increase . This is the job of second term. The negative sign in front of that term ensures that when negative errors continue to accumulate, the positive output continues to increase. Since it takes lots of steps to train deep VAE models, the increase per step should be very small, favoring smaller values of . Empirically we found that a value between and stabilizes the training. Note that, should not be too small either, because it would then unnecessarily slow down the convergence.

3.2 Set Point Guidelines for ControlVAE

For generative models, human graders are often needed to evaluate the generated samples, so it is very hard to get the optimal set point of KL-divergence for ControlVAE. Nevertheless, rules of thumb may apply from a controllability perspective. Note that, as we allude to earlier, is application-specific. In general, when , the upper bound of expected KL-divergence is the value of KL-divergence as ControlVAE converges when , denoted by . Similarly, its lower bound, , can be defined as the KL-divergence produced by ControlVAE when . For feedback control to be most effective (i.e., not run against the above limits), the KL-divergence set point should be somewhere in the middle between these extremes. The closer it is to an extreme, the worse is controllability in one of the directions. Finally, if the set point is outside the interval , then manipulating within the interval will be ineffective at maintaining KL-divergence at that set point.

3.3 Summary of the PI Control Algorithm

We summarize the proposed PI control algorithm in Algorithm 1. Our PI algorithm updates the hyperparameter, , with the feedback from sampled KL-divergence at training step . Line computes the error between the desired KL-divergence, , and the sampled . Line to calculate the P term and I term for the PI algorithm, respectively. Note that, Line 10 and 11 is a popular constraint in PID/PI design, called anti-windup (Azar and Serrano, 2015; Peng et al., 1996). It effectively disables the integral term of the controller when controller output gets out of range, not to exacerbate the out-of-range deviation. Line is the calculated hyperparameter by PI algorithm in (5). Finally, Line to aim to limit to a certain range, .

1:  Input: desired KL , coefficients , , max/min value , , iterations
2:  Output: hyperparameter at training step
3:  Initialization: ,
4:  for  to  do
5:     Sample KL-divergence,
8:     if   then
10:     else
11:           // Anti-windup
12:     end if
14:     if  then
16:     end if
17:     if  then
19:     end if
20:     Return
21:  end for
Algorithm 1 PI algorithm.

3.4 Applications of ControlVAE

As a preliminary demonstration of the general applicability of the above approach and as an illustration of its customizability, we apply ControlVAE to three different applications stated below.

  • Language modeling: We first apply ControlVAE to solve the KL vanishing problem meanwhile improve the diversity of generated data. As mentioned in Section 2.1, the VAE models often suffer from KL vanishing in language modeling. The existing methods cannot completely solve the KL vanishing problem because they blindly change in the VAE objective without monitoring the output KL-divergence during model training. In this paper, we adopt ControlVAE to control KL-divergence to a specified value to avoid KL vanishing using the output KL-divergence. Following PI tuning strategy in Section 3.1, we set , of the PI algorithm in (5) to and , respectively. In addition, is set to and the maximum value of is limited to .

  • Disentangling: We then apply the ControlVAE model to achieve a better trade-off between reconstruction quality and disentangling. As mentioned in Section 2.2, -VAE () assigns a large hyperparameter to the objective function to control the KL-divergence (information bottleneck), which, however, leads to a large reconstruction error. To mitigate this issue, we adopt ControlVAE to automatically adjust the hyperparameter based on the output KL-divergence during model training. Using the similar methodology in Burgess et al. (2018), we train a single model by gradually increasing KL-divergence from to a desired value . However, different from -VAE that linearly increases , we adopt a step function to increase for every training steps in order to stabilize model training. Since , we set to for the PI algorithm in (5). Following the PI tuning method above, the coefficients and are set to and , respectively.

  • Image generation: The basic VAE models tend to produce blurry and unrealistic samples for image generation Zhao et al. (2017a). In this paper, we try to leverage ControlVAE to manipulate (slightly increase) the value of KL-divergence to improve the reconstruction quality of generated images. Different from the original VAE (), we extend the range of the hyperparameter, , from to in our controlVAE model. Given a desired KL-divergence, controlVAE can automatically tune within that range. For this task, we use the same PI control algorithm and hyperparameters as the above language modeling.

4 Experiments

We evaluate the performance of ControlVAE on real-world datasets in the three different applications mentioned above. Source code will be publicly available upon publication.

(a) KL divergence
(b) Reconstruction loss
Figure 3: Performance comparison for different methods on the PTB data. (a) shows that ControlVAE and Cyclical annealing ( cycles) can avert KL vanishing, while Cost annealing still suffers from KL vanishing after and training steps. Moreover, ControlVAE can control the KL-divergence and also has lower reconstruction errors than the other methods in (b).
Methods/metric Dis-1 Dis-2 Dis-3 Dis-4
ControlVAE-KL-35 6.27K 41 95.86K 1.02K 274.36K 3.02K 405.65K 3.94K
ControlVAE-KL-25 6.10K 60 83.15K 4.00K 244.29K 12.11K 385.46K 15.79K
Cost annealing (KL = 17) 5.71K 87 69.60K 1.53K 208.62K 4.04K 347.65K 5.85K
Cyclical (KL = 21.5) 5.79K 81 71.63K 2.04K 211.29K 6.38K 345.17K 11.65K
Table 1: Performance comparison for different methods on dialog-generation using SW data. We use - to measure the diversity of generated dialog averaged over random seeds. The higher is better.

4.1 Datasets

The datasets used for our experiments are introduced below.

  • [noitemsep]

  • Language modelling: 1) Penn Tree Bank (PTB) Marcus et al. (1993): it consists of training sentences, validation sentences and testing sentences. 2) Switchboard(SW) Godfrey and Holliman (1997): it has two-sided telephone conversations with manually transcribed speech and alignment. The data is randomly split into , and dialog for training, validation and testing.

  • Disentangling: 1) 2D Shapes Matthey et al. (2017): it has binary images of 2D shapes with five ground truth factors (number of values): shape(3), scale(6), orientation(40), x-position(32), y-position(32) Kim and Mnih (2018).

  • Image generation: 1) CelebA(cropped version) Liu et al. (2015): It has RGB images of celebrity faces. The data is split into and images for training and testing.

4.2 Model Configurations

The detailed model configurations and hyperparameter settings for each model is presented in Appendix A.

4.3 Evaluation on Language Modeling

First, we compare the performance of ControlVAE with the following baselines for mitigating KL vanishing in text generation Bowman et al. (2015). Cost annealing Bowman et al. (2015): This method gradually increases the hyperparameter on KL-divergence from until to after training steps using sigmoid function. Cyclical annealing Liu et al. (2019): This method splits the training process into cycles and each increases the hyperparameter from until to using a linear function.

Fig. 3 illustrates the comparison results of KL-divergence, reconstruction loss and hyperparamter for different methods on the PTB dataset. Note that, here ControlVAE-KL- means we set the KL-divergence to a desired value (e.g., 3) for our PI controller following the set point guidelines in Section 3.2. Cost-annealing- means we increase the hyperparameter, , from until to after steps. We observe from Fig. 3(a) that ControlVAE (KL=1.5, 3) and Cyclical annealing ( cycles) can avert the KL vanishing. However, our ControlVAE is able to stabilize the KL-divergence while cyclical annealing could not. Moreover, our method has a lower reconstruction loss than the cyclical annealing in Fig. 3 (b). Cost annealing method still suffers from KL vanishing, because we use the Transformer Vaswani et al. (2017) as the decoder, which can predict the current data based on previous ground-truth data. Fig. 3 (c) illustrates the tuning result of by ControlVAE compared with other methods. We can discover that our gradually converges to around a certain value. Note that, here of ControlVAE does not converge to because we slightly increase the value of KL-divergence (produced by the original VAE) in order to improve the diversity of generated data.

In order to further demonstrate ControlVAE can improve the diversity of generated text, we apply it to dialog-response generation using the Switchboard(SW) dataset. Following Zhao et al. (2017b), we adopt a conditional VAE Zhao et al. (2017b) that generates dialog conditioned on the previous response. According to literature Xu et al. (2018), metric -, the number of distinct grams, is used to measure the diversity of generated data. Table 1 illustrates the comparison results for different approaches. We can see that ControlVAE has more distinct grams than the Cost annealing and Cyclical annealing when the desired KL-divergence is set to and . Thus, we can conclude that ControlVAE can improve the diversity of generated data by slightly increasing the KL-divergence of the original VAE. We also illustrate some examples of generated dialog by ControlVAE in Appendix B.

(a) Reconstruction loss
(c) Disentangled factors
Figure 4: (a) (b) shows the comparison of reconstruction error and using 2D Shapes data over

random seeds. ControlVAE (KL=16, 18) has lower reconstruction errors and variance compared to

-VAE. (c) shows an example about the disentangled factors in the latent variable as the total KL-divergence increases from to for ControlVAE (KL=18). Each curve with positive KL-divergence (except black one) represents one disentangled factor by ControlVAE.

4.4 Evaluation on Disentangled Representations

We then evaluate the performance of ControlVAE on the learning of disentangled representations using 2D Shapes data. We compare it with two baselines: FactorVAE Kim and Mnih (2018) and -VAE Burgess et al. (2018).

Metric ControlVAE (KL=16) ControlVAE (KL=18) -VAE () FactorVAE ()
MIG 0.5519 0.0323 0.5146 0.0199 0.5084 0.0476 0.5139 0.0428
Table 2: Performance comparison of different methods using disentanglement metric, MIG score, averaged over 5 random seeds. The higher is better. ControlVAE (KL=16) has a higher MIG score but lower variance than the baselines with the default parameters.

Figure 5: Rows: latent traversals ordered by the value of KL-divergence with the prior in a descending order. Following work Burgess et al. (2018), we initialize the latent representation from a seed image, and then traverse a single latent dimension in a range of , while keeping the remaining latent dimensions fixed. ControlVAE can disentangle all the five generative factors for 2D Shapes data, while -VAE entangles the scale and shape (in 3rd row) and FactorVAE does not disentangle orientation (in 4th row) very well.

Fig. 4 (a) and (b) shows the comparison of reconstruction error and the hyperparameter (using random seeds) for different models. We can observe from Fig. 4 (a) that ControlVAE (KL=16,18) has lower reconstruction error and variance than the baselines. This is because our ControlVAE automatically adjusts the hyperparameter, , to stabilize the KL-divergence, while the other two methods keep the hyperparameter unchanged during model training. Specifically, for ControlVAE (KL=18), the hyperparameter is high in the beginning in order to obtain good disentangling, and then it gradually drops to around as the training converges, as shown in Fig. 4(b). In contrast, -VAE () has a large and fixed weight on the KL-divergence so that its optimization algorithm tends to optimize the KL-divergence term, leading to a large reconstruction error. In addition, Fig. 4(c) illustrates an example of KL-divergence per factor in the latent code as training progresses and the total information capacity (KL-divergence) increases from until to . We can see that ControlVAE disentangles all the five generative factors, starting from positional latents ( and ) to scale, followed by orientation and then shape.

To further demonstrate ControlVAE can achieve a better disentangling, we use a disentanglement metric, mutual information gap (MIG) Chen et al. (2018a), to compare their performance, as shown in Table 2. It can be observed that ControlVAE (KL=16) has a higher MIG score but lower variance than the other methods. Besides, we show the qualitative results of different models in Fig. 5. We can observe that ControlVAE can discover all the five generative factors: positional latent ( and ), scale, orientation and shape. However, -VAE () disentangles four generative factors except for entangling the scale and shape together (in the third row), while FactorVAE () does not disentangle the orientation factor very well in the fourth row in Fig. 5. Thus, ControlVAE achieves a better reconstruction quality and disentangling than the baselines.

4.5 Evaluation on Image Generation

Finally, we compare the reconstruction quality of image generation for ControlVAE and the original VAE. Fig. 6 shows the comparison of reconstruction error and KL-divergence under different desired values of KL-divergence for random seeds. We can see from Fig. 6(a) that ControlVAE-KL-200 (KL=200) has the lowest reconstruction error among them. In addition, as we set the desired KL-divergence to (same as the basic VAE in Fig. 6(b)), ControlVAE has the same reconstruction error as the original VAE. At that point, ControlVAE becomes the original VAE as finally converges to , as shown in Fig. 7 in Appendix C.

We further adopt two commonly used metrics for image generation, FID Lucic et al. (2018) and SSIM Chen et al. (2018b), to evaluate the performance of ControlVAE in Table 3. It can be observed that ControlVAE-KL-200 outperforms the other methods in terms of FID and SSIM. Therefore, our ControlVAE can improve the reconstruction quality of generate images via controlling the value of KL-divergence. We also show some generated images to verify ControlVAE has a better reconstruction quality than the basic VAE in Appendix D.

(a) Reconstruction loss
(b) KL-divergence
Figure 6: Performance comparison for different methods on the CelebA data averaged over random seeds.
Methods/metric FID SSIM
ControlVAE-KL-200 55.16 0.187 0.687 0.0002
ControlVAE-KL-180 57.57 0.236 0.679 0.0003
ControlVAE-KL-170 58.75 0.286 0.675 0.0001
Original VAE 58.71 0.207 0.675 0.0001
Table 3: Performance comparison for different methods on CelebA data over random seeds. FID: lower is better. SSIM: higher is better.

5 Related Work

In language modeling, VAE often suffers from KL vanishing, due to a powerful decoder, such as Transformer Vaswani et al. (2017) and LSTM. In recent years, researchers develop many methods, such as KL cost annealing method Bowman et al. (2015), cyclical annealing and dilated CNN decoder Yang et al. (2017), to tackle this problem. However, these methods cannot totally solve the KL vanishing issue or explicitly control the KL divergence. On the contrary, our approach can avert KL vanishing by using a PI control algorithm to automatically tune the hyperparameter in the objective based on the output KL divergence.

Recently, researchers proposed a novel modification of VAE, called -VAE (Higgins et al. (2017), to learn the disentangled representations. They assigned a large value to the hyperparameter to disentangle the generative factors. However, -VAE sacrifice the reconstruction quality in order to obtain better disentangling. Then researchers developed other models, such as FactorVAE Kim and Mnih (2018); Kim et al. (2019) and TCVAE Chen et al. (2018a), to improve the reconstruction quality. However, the drawback of these methods is that they assign a fixed hyperparameter to the KL term or the decomposed terms in the objective. In contrast, our ControlVAE can automatically tune the hyperparameter during optimization to stabilize the KL divergence, which can also be used as a plug-in replacement of existing methods.

VAE and its variants are also applied to generate fake images, but the generated samples are blurry and unrealistic Zhao et al. (2017a). In order to improve its performance, researchers developed a new variational lossy autoencoder (VLAE) by borrowing the idea from autoregressive flow. However, the computational complexity of VLAE is pretty expensive. Besides, researchers adopted a constrained optimization for reconstruction Rezende and Viola (2018); Klushyn et al. (2019) to achieve the trade-off between reconstruction error and KL-divergence. However, these methods may suffer from posterior collapse if the inference network fails to cover the latent space. Recent studies mainly adopt generative adversarial networks (GANs) Goodfellow et al. (2014); Zhu et al. (2017); Radford et al. (2015); Arjovsky et al. (2017) to improve the quality of generated images. However, it is very difficult to train GANs because they easily suffers from collapse. Different from existing methods, we add an additional hyperparameter on the KL term, and then leverage ControlVAE to manipulate the KL divergence to reduce the reconstruction error.

6 Conclusion

In this paper, we proposed a general controllable VAE framework, ControlVAE, that combines automatic control with the basic VAE framework to improve the performance of the VAE models. We designed a new non-linear PI controller to control the value of KL divergence during model training. Then we evaluated ControlVAE on three different tasks. The results show that ControlVAE attains better performance; it improves ability to disentangle latent factors. It averts KL vanishing in language modeling. It improves the reconstruction quality for image generation as well. Other applications are a topic of the authors’ future research.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §5.
  • K. J. Åström, T. Hägglund, and K. J. Astrom (2006) Advanced pid control. Vol. 461, ISA-The Instrumentation, Systems, and Automation Society Research Triangle …. Cited by: §2.
  • A. T. Azar and F. E. Serrano (2015) Design and modeling of anti wind up pid controllers. In Complex system modelling and control through intelligent soft computations, pp. 1–44. Cited by: §3.3.
  • Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2.2.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §A.1, §1, §2.1, §2, §4.3, §5.
  • C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1, §2.2, §2, 2nd item, Figure 5, §4.4.
  • T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018a) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.2, §4.4, §5.
  • X. Chen, C. Xu, X. Yang, and D. Tao (2018b) Attention-gan for object transfiguration in wild images. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 164–180. Cited by: §4.5.
  • E. L. Denton et al. (2017) Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §1.
  • J. Godfrey and E. Holliman (1997) Switchboard-1 release 2-linguistic data consortium. SWITCHBOARD: A User’s Manual. Cited by: 1st item.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.
  • J. L. Hellerstein, Y. Diao, S. Parekh, and D. M. Tilbury (2004) Feedback control of computing systems. John Wiley & Sons. Cited by: §2.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §A.2, §1, §2.2, §2, §5.
  • Z. Hu, H. Shi, B. Tan, W. Wang, Z. Yang, T. Zhao, J. He, L. Qin, D. Wang, et al. (2019) Texar: a modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations, Cited by: §A.1.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1587–1596. Cited by: §1.
  • O. Ivanov, M. Figurnov, and D. Vetrov (2018) Variational autoencoder with arbitrary conditioning. arXiv preprint arXiv:1806.02382. Cited by: §2.1.
  • H. Kim and A. Mnih (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §1, §2, 2nd item, §4.4, §5.
  • M. Kim, Y. Wang, P. Sahu, and V. Pavlovic (2019) Relevance factor vae: learning and identifying disentangled factors. arXiv preprint arXiv:1902.01568. Cited by: §5.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.1.
  • A. Klushyn, N. Chen, R. Kurle, B. Cseke, and P. van der Smagt (2019) Learning hierarchical priors in vaes. In Advances in Neural Information Processing Systems, pp. 2866–2875. Cited by: §5.
  • M. Liu, T. Breuel, and J. Kautz (2017)

    Unsupervised image-to-image translation networks

    In Advances in neural information processing systems, pp. 700–708. Cited by: §1.
  • X. Liu, J. Gao, A. Celikyilmaz, L. Carin, et al. (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §2.1, §2, §3.1, §4.3.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: 3rd item.
  • M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §4.5.
  • M. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of english: the penn treebank. Cited by: 1st item.
  • L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner (2017) Dsprites: disentanglement testing sprites dataset. URL https://github. com/deepmind/dsprites-dataset/.[Accessed on: 2018-05-08]. Cited by: 2nd item.
  • Y. Peng, D. Vrancic, and R. Hanus (1996) Anti-windup, bumpless, and conditioned transfer techniques for pid controllers. IEEE Control systems magazine 16 (4), pp. 48–57. Cited by: §3.3.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.
  • D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §2.1.
  • D. J. Rezende and F. Viola (2018) Taming vaes. arXiv preprint arXiv:1810.00597. Cited by: §5.
  • K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.3, §5.
  • W. Wang, Z. Gan, H. Xu, R. Zhang, G. Wang, D. Shen, C. Chen, and L. Carin (2019) Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137. Cited by: §1.
  • J. Xu, X. Ren, J. Lin, and X. Sun (2018) Dp-gan: diversity-promoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345. Cited by: §4.3.
  • X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §1.
  • Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3881–3890. Cited by: §5.
  • S. Zhao, J. Song, and S. Ermon (2017a) Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: 3rd item, §5.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017b) Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. Cited by: §A.1, §4.3.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §5.

Appendix A Model Configurations and hyperparameter settings

We summarize the detailed model configurations and hyperparameter settings for ControlVAE in three applications below. Our source code is already submitted to the review system.

a.1 Experimental Details for Language Modeling

For text generation on PTB data, we build the ControlVAE model on the basic VAE model, as in Bowman et al. [2015]

. We use one-layer LSTM as the encoder and a three-layer Transformer with eight heads as the decoder and a Multi-Layer Perceptron (MLP) to learn the latent variable 

. The maximum sequence length for LSTM and Transformer is set to , respectively. And the size of latent variable is set to . Then we set the dimension of word embedding to and the batch size to . In addition, the dropout is for LSTM and Transformer. Adam optimization with the learning rate is used during training. Following the tuning guidelines above, we set the coefficients and of P term and I term to and , respectively. Finally, We adopt the source code on Texar platform to implement experiments Hu et al. [2019].

For dialog-response generation, we follow the model architecture and hyperparameters of the basic conditional VAE in Zhao et al. [2017b]. We use one-layer Bi-directional GRU as the encoder and one-layer GRU as the decoder and two fully-connected layers to learn the latent variable. In the experiment, the size of both latent variable and word embeddings is set to . The maximum length of input/output sequence for GRU is set to with batch size . In addition, Adam with initial learning rate is used. In addition, we set the same and of PI algorithm as text generation above. The model architectures of ControlVAE for these two NLP tasks are illustrated in Table 45.

Encoder Decoder
Input words Input ,
1-layer LSTM FC
FC 3-layer Transformer 8 heads
Table 4: Encoder and decoder architecture for text generation on PTB data.
Encoder Decoder
Input words Input
1-layer bi-GRU FC
FC 1-layer GRU
Table 5: Encoder and decoder architecture for dialog generation on Switchboard (SW) data.

a.2 Experimental Details for Disentangling

Following the same model architecture of -VAE Higgins et al. [2017], we adopt a convolutional layer and deconvolutional layer for our experiments. We use Adam optimizer with , and a learning rate tuned from . We set and for PI algorithm to and , respectively. For the step function, we set the step, , to per training steps as the information capacity (desired KL- divergence) increases from until for 2D Shape data. ControlVAE uses the same encoder and decoder architecture as -VAE except for plugging in PI control algorithm, illustrated in Table 6.

Encoder Decoder
Input binary image Input
conv. ReLU. stride 2 FC. 256 ReLU.
conv. ReLU. stride 2 upconv. ReLU. stride 2
conv. ReLU. stride 2 upconv. ReLU. stride 2.
conv. ReLU. stride 2 upconv. ReLU. stride 2
conv. ReLU. stride 2 upconv. ReLU. stride 2
FC . FC. upconv. ReLU. stride 2
Table 6: Encoder and decoder architecture for disentangled representation learning on 2D Shapes data.

a.3 Experimental Details for Image Generation

Similar to the architecture of

-VAE, we use a convolutional layer with batch normalization as the encoder and a deconvolutional layer with batch normalization for our experiments. We use Adam optimizer with

, and a learning rate for CelebA data. The size of latent variable is set to , because we find it has a better reconstruction quality than and . In addition, we set the desired value of KL-divergence to (same as the original VAE), , and . For PI control algorithm, we set and to and , respectively. We also use the same encoder and decoder architecture as -VAE above except that we add the batch normalization to improve the stability of model training, as shown in Table 7.

Encoder Decoder
Input RGB image Input
conv. ReLU. stride 2 FC. 256 ReLU.
conv. ReLU. stride 2 upconv. ReLU. stride 2
conv. ReLU. stride 2 upconv. ReLU. stride 2.
conv. ReLU. stride 2 upconv. ReLU. stride 2
conv. ReLU. stride 2 upconv. ReLU. stride 2
FC . FC. upconv. ReLU. stride 2
Table 7: Encoder and decoder architecture for image generation on CelebA data.

Appendix B Examples of Generated Dialog by ControlVAE

In this section, we show an example to compare the diversity and relevance of generated dialog by different methods, as illustrated in Table 8. Alice begins with the open-ended conversation on choosing a college. Our model tries to predict the response from Bob. The ground truth response is “um - hum”. We can observe from Table 8 that ControlVAE (KL=25, 35) can generate diverse and relevant response compared with the ground truth. In addition, while cyclical annealing can generate diverse text, some of them are not very relevant to the ground-truth response.

Context: (Alice) and a lot of the students in that home town sometimes unk the idea of staying and going to school across the street so to speak
Topic: Choosing a college  Target: (Bob) um - hum
ControlVAE-KL-25 ControlVAE-KL-35 Cost annealing (KL=17) Cyclical anneal (KL=21.5)
yeah uh - huh oh yeah yeah that’s true do you do you do it
um - hum yeah uh - huh yeah
oh that’s right um - hum oh yeah oh absolutely right um - hum
yes right uh - huh and i think we have to be together yeah that’s a good idea
right um - hum oh well that’s neat yeah well yeah i see it too,it’s a neat place
Table 8: Examples of generated dialog for different methods. Our model tries to predict the response from Bob. The response generated by ControlVAE (KL=25,35) are relevant and diverse compared with the ground truth. However, some of reponse generated by cost annealing and cyclical annealing are not very relevant to the ground-truth data

Appendix C of ControlVAE for Image Generation on CelebA data

Fig. 7 illustrates the comparison of for different methods during model training. We can observe that finally converges to when the desired value of KL-divergence is set to , same as the original VAE. At this point, ControlVAE becomes the original VAE. Thus, ControlVAE can be customized by users based on different applications.

Figure 7: Hyperparameter of ControlVAE for image generation on CelebA data for 3 random seeds. If we set the desired value of KL-divergence to , the hyperparameter, , gradually approaches . It means the ControlVAE becomes the original VAE.

Appendix D Examples of Generated Images by VAE and ControlVAE

We also show the some generated images by ControlVAE and the original VAE in Fig. 8. It can be observed that images generated by ControlVAE-KL-200 (KL = ) has the best reconstruction quality compared to the original VAE. Take the woman in the first row last column as an example. The woman does not show her teeth in the ground-truth image. However, we can see the woman generated by the original VAE smiles with mouth opening. In contrast, the woman generated by ControlVAE-KL-200 hardly show her teeth when smiling. In addition, we also discover from the other two examples marked with blue rectangles that ControlVAE-KL-200 can better reconstruct the “smile” from the man and the “ear” from the woman compared to the original VAE. Therefore, we can conclude that our ControlVAE can improve the reconstruction quality via slightly increasing (control) KL-divergence compared to the original VAE. It should be pointed out that the comparison results are not very obvious because we use a simple VAE model in the experiments. For future work, we are going to adopt advanced VAE models to improve the performance.

(a) Ground truth
(b) Original VAE
(c) ControlVAE-KL-200
(d) ControlVAE-KL-170
Figure 8: Examples of generated images by different methods and ground truth. From the images marked with blue rectangles, we can see that ControlVAE-KL-200 (KL=200) can better reconstruct woman’s month opening (first row last column), man’s smiling with teeth (second row fourth column), and woman’ear (third row fourth column) than the original VAE based on the ground-truth data in (a).