1 Introduction
This paper proposes a novel controllable variational autoencoder, ControlVAE, that leverages automatic control to precisely control the tradeoff between data reconstruction accuracy bounds (from a learned latent representation) and applicationspecific constraints, such as output diversity or disentangled latent factor representation. Specifically, a controller is designed that stabilizes the value of KLdivergence (between the learned approximate distribution of the latent variables and their true distribution) in the VAE’s objective function to achieve the desired tradeoff, thereby improving applicationspecific performance metrics of several existing VAE models.
The work is motivated by the increasing popularity of VAEs as an unsupervised generative modeling framework that learns an approximate mapping between Gaussian latent variables and data samples when the true latent variables have an intractable posterior distribution Sohn et al. (2015); Kingma and Welling (2013). Since VAEs can directly work with both continuous and discrete input data Kingma and Welling (2013), they have been widely adopted in various applications, such as image generation Yan et al. (2016); Liu et al. (2017), dialog generation Wang et al. (2019); Hu et al. (2017), and disentangled representation learning Higgins et al. (2017); Kim and Mnih (2018).
Popular VAE applications often involve a tradeoff between reconstruction accuracy bounds and some other applicationspecific goal, effectively manipulated through KLdivergence. For example, in (synthetic) text or image generation, a goal is to produce new original text or images, as opposed to reproducing one of the samples in training data. In text generation, if KLdivergence is too low, output diversity is reduced Bowman et al. (2015), which is known as the KLvanishing problem. To increase output diversity, it becomes advantageous to artificially increase KLdivergence. The resulting approximation was shown to produce more diverse, yet still authenticlooking outputs. Conversely, disentangled representation learning Denton and others (2017) leverages the observation that KLdivergence in the VAE constitutes an upper bound on information transfer through the latent channels per data sample Burgess et al. (2018). Artificially decreasing KLdivergence (e.g., by increasing its weight in a VAE’s objective function, which is known as the VAE) therefore imposes a stricter information bottleneck, which was shown to force the learned latent factors to become more independent (i.e., nonredundant), leading to a better disentangling. The above examples suggest that a useful extension of VAEs is one that allows users to exercise explicit control over KLdivergence in the objective function. ControlVAE realizes this extension.
We apply ControlVAE to three different applications: language modeling, disentangling, and image generation. Evaluation results on realworld datasets demonstrate that ControlVAE is able to achieve an adjustable tradeoff between reconstruction error and KLdivergence. It can discover more disentangled factors and significantly reduce the reconstruction error compared to the VAE Burgess et al. (2018) for disentangling. For language modeling, it can not only completely avert the KL vanishing problem, but also improve the diversity of generated data. Finally, we also show that ControlVAE improves the quality of synthetic image generation via slightly increasing the value of KL divergence compared with the original VAE.
2 Preliminaries
The objective function of VAEs consists of two terms: loglikelihood and KLdivergence. The first term tries to reconstruct the input data, while KLdivergence has the desirable effect of keeping the representation of input data sufficiently diverse. In particular, KLdivergence can affect both the reconstruction quality and diversity of generated data. If the KLdivergence is too high, it would affect the accuracy of generated samples. If it is too low, output diversity is reduced, which may be a problem in some applications such as language modeling Bowman et al. (2015) (where it is known as the KLvanishing problem).
To mitigate KL vanishing, one promising way is to add an extra hyperparameter in the VAE objective function to control the KLdivergence via increasing from until to
with sigmoid function or cyclic function
Liu et al. (2019). These methods, however, blindly change without sampling the actual KLdivergence during model training. Using a similar methodology, researchers recently developed a new VAE () Higgins et al. (2017); Burgess et al. (2018) to learn the disentangled representations by controlling the value of KLdivergence. However, VAE suffers from high reconstruction errors Kim and Mnih (2018), because it adds a very large in the VAE objective so the model tends to focus disproportionately on optimizing the KL term. In addition, its hyperparameter is fixed during model training, missing the chance of balancing the reconstruction error and KLdivergence.The core technical challenge responsible for the above application problems lies in the difficulty to tune the weight of the KLdivergence term during model training. Inspired by control systems, we fix this problem using feedback control. Our controllable variational autoencoder is illustrated in Fig. 1. It samples the output KLdivergence at each training step , and feeds it into an algorithm that tunes the hyperparameter, , accordingly, aiming to stabilize KLdivergence at a desired value, called the set point.
We further design a nonlinear PI controller, a variant of the PID control algorithm Åström et al. (2006), to tune the hyperparameter . PID control is the basic and most prevalent form of feedback control in a large variety of industrial Åström et al. (2006) and software performance control Hellerstein et al. (2004) applications. The basic idea of the PID algorithm is to calculate an error, , between a set point (in this case, the desired KLdivergence) and the current value of the controlled variable (in this case, the actual KLdivergence), then apply a correction in a direction that reduces that error. The correction is applied to some intermediate directly accessible variable (in our case, ) that influences the value of the variable we ultimately want to control (KLdivergence). In general, the correction computed by the controller is the weighted sum of three terms; one changes with error (P), one changes with the integral of error (I), and one changes with the derivative of error (D). In a nonlinear controller, the changes can be described by nonlinear functions. Note that, since derivatives essentially compute the slope of a signal, when the signal is noisy, the slope often responds more to variations induced by noise. Hence, following established best practices in control of noisy systems, we do not use the derivative (D) term in our specific controller. Next, we introduce VAEs and our objective in more detail.
2.1 The Variational Autoencoder (VAE)
Suppose that we have a dataset of i.i.d. samples that are generated by the groundtruth latent variable , interpreted as the representation of the data. Let denote a probabilistic decoder
with a neural network to generate data
given the latent variable . The distribution of representation corresponding to the dataset is approximated by the variational posterior, , which is produced by an encoder with a neural network. The Variational Autoencoder (VAE) Rezende et al. (2014); Kingma and Welling (2013) has been one of the most popular generative models. The basic idea of VAE can be summarized in the following: (1) VAE encodes the input data samples into a latent variable as its distribution of representation via a probabilistic encoder, which is parameterised by a neural network. (2) then adopts the decoder to reconstruct the original input data based on the samples from. VAE tries to maximize the marginal likelihood of the reconstructed data, but it involves intractable posterior inference. Thus, researchers adopt backpropagation and stochastic gradient descent
Kingma and Welling (2013) to optimize its variational lower bound of log likelihood Kingma and Welling (2013).(1) 
where is the prior distribution, e.g., standard Gaussian. and denote the distribution parameterized by a neural network with the corresponding parameter and , respectively. The first term in (1) is reconstruction term while the latter term is called KLdivergence. In addition, a reparameterization trick is used to calculate the gradient of lower bound with respect to Ivanov et al. (2018). It is defined by , where .
However, the basic VAE models cannot explicitly control the KLdivergence to a specified value. They also often suffer from KL vanishing (in language modeling Bowman et al. (2015); Liu et al. (2019)), which means the KLdivergence becomes zero during optimization. To remedy this issue, one popular way is to add a hyperparameter on the KL term Bowman et al. (2015); Liu et al. (2019), and then gradually increases it from until . However, the existing methods, such as KL cost annealing and cyclical annealing Bowman et al. (2015); Liu et al. (2019), cannot totally avert KL vanishing because they blindly vary the hyperparameter during model training.
2.2 Vae
VAE Higgins et al. (2017); Chen et al. (2018a) is an extension to the basic VAE framework, often used as an unsupervised method for learning a disentangled representation of the data generative factors. A disentangled representation, according to the literature Bengio et al. (2013), is defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Compared to the original VAE, VAE adds an extra hyperparameter as a weight of KLdivergence in the original VAE objective (1). It can be expressed by
(2) 
In order to discover more disentangled factors, researchers further put a constraint on total information capacity, , to control the capacity of the information bottleneck (KLdivergence) Burgess et al. (2018). Then Lagrangian method is adopted to solve the following optimization problem.
(3) 
where is a large hyperparameter (e.g., 100).
However, one drawback of VAE is that it obtains good disentangling at the cost of reconstruction quality. When the weight is large, the optimization algorithm tends to optimize the second term in (3), leading to a high reconstruction error.
The above background suggests that a common challenge in applying VAEs (and their extensions) lies in appropriate weight allocation among the reconstruction accuracy and KLdivergence in the VAEs objective function. As mentioned earlier, we solve this using a nonlinear PI controller that manipulates the value of the nonnegative hyperparameter, . This algorithm is described next.
3 The ControlVAE Algorithm
During model training, we sample the output KLdivergence, which we denote by , at training step . The sampled KLdivergence is then compared to the set point, , and the difference, then used as the feedback to a controller to calculate the hyperparameter . ControlVAE can be expressed by the following variational lower bound:
(4) 
When KLdivergence drops below the set point, the controller counteracts this change by reducing the hyperparameter (to reduce penalty for KLdivergence in the objective function (4)). The reduced weight, , allows KLdivergence to grow, thus approaching the set point again. Conversely, when KLdivergence grows above the set point, the controller increases (up to a certain value), thereby increasing the penalty for KLdivergence and forcing it to decrease. This effect is achieved by computing using Equation (5), below, which is an instance of nonlinear PI control:
(5) 
where and are the constants. The first term (on the right hand side) ranges between and thanks to the exponential function . Note that when error is large and positive (KLdiverge is below set point), the first term approaches 0, leading to a lower that encourages KLdivergence to grow. Conversely, when error is large and negative (KLdivergence above set point), the first term approaches its maximum (which is ), leading to a higher that encourages KLdivergence to shrink.
The second term of the controller sums (integrates) past errors with a sampling period (one training step in this paper). This creates a progressively stronger correction (until the sign of the error changes). The negative sign ensures that while errors remain positive (i.e., when KLdivergence is below set point), this term continues to decrease, whereas while errors remain negative (i.e., when KLdivergence is above set point), this term continues to increase. In both cases, the change forces in a direction that helps KLdivergence approach the set point. In particular, note that when the error becomes zero, the second term (and thus the entire right hand side) stops changing, allowing controller output, , to stay at the same value that hopefully caused the zero error in the first place. This allows the controller to “lock in" the value of that meets the KLdivergence set point. Finally, is an applicationspecific constant. It effectively shifts the range within which is allowed to vary. This PI controller is illustrated in Fig. 2.
3.1 PI Parameter Tuning for ControlVAE
One challenge of applying the PI control algorithm lies how to tune its parameters, and effectively. While optimal tuning of nonlinear controllers is nontrivial, in this paper we follow a very simple rule: tune these constants to ensure that reactions to errors are sufficiently smooth to allow gradual convergence. Let us first consider the coefficient . Observe that the maximum (positive) error occurs when actual KLdivergence is close to zero. In this case, if is the set point on KLdivergence, then the error, , is approximated by . When KLdivergence is too small, the VAE does not learn useful information from input data Liu et al. (2019). We need to assign a very small nonnegative value, so that KLdivergence is encouraged to grow (when the resulting objective function is optimized). In other words, temporarily ignoring other terms in Equation (5), the contribution of the first term alone should be sufficiently small:
(6) 
where is a small constant (e.g., in our implementation). The above (6) can also be rewritten as . Empirically, we find that leads to good performance and satisfies the above constraint.
Conversely, when the actual KLdivergence is much larger than the desired value , the error becomes a large negative value. As a result, the first term in (5) becomes close to a constant, . If the resulting larger value of is not enough to cause KLdivergence to shrink, one needs to gradually continue to increase . This is the job of second term. The negative sign in front of that term ensures that when negative errors continue to accumulate, the positive output continues to increase. Since it takes lots of steps to train deep VAE models, the increase per step should be very small, favoring smaller values of . Empirically we found that a value between and stabilizes the training. Note that, should not be too small either, because it would then unnecessarily slow down the convergence.
3.2 Set Point Guidelines for ControlVAE
For generative models, human graders are often needed to evaluate the generated samples, so it is very hard to get the optimal set point of KLdivergence for ControlVAE. Nevertheless, rules of thumb may apply from a controllability perspective. Note that, as we allude to earlier, is applicationspecific. In general, when , the upper bound of expected KLdivergence is the value of KLdivergence as ControlVAE converges when , denoted by . Similarly, its lower bound, , can be defined as the KLdivergence produced by ControlVAE when . For feedback control to be most effective (i.e., not run against the above limits), the KLdivergence set point should be somewhere in the middle between these extremes. The closer it is to an extreme, the worse is controllability in one of the directions. Finally, if the set point is outside the interval , then manipulating within the interval will be ineffective at maintaining KLdivergence at that set point.
3.3 Summary of the PI Control Algorithm
We summarize the proposed PI control algorithm in Algorithm 1. Our PI algorithm updates the hyperparameter, , with the feedback from sampled KLdivergence at training step . Line computes the error between the desired KLdivergence, , and the sampled . Line to calculate the P term and I term for the PI algorithm, respectively. Note that, Line 10 and 11 is a popular constraint in PID/PI design, called antiwindup (Azar and Serrano, 2015; Peng et al., 1996). It effectively disables the integral term of the controller when controller output gets out of range, not to exacerbate the outofrange deviation. Line is the calculated hyperparameter by PI algorithm in (5). Finally, Line to aim to limit to a certain range, .
3.4 Applications of ControlVAE
As a preliminary demonstration of the general applicability of the above approach and as an illustration of its customizability, we apply ControlVAE to three different applications stated below.

Language modeling: We first apply ControlVAE to solve the KL vanishing problem meanwhile improve the diversity of generated data. As mentioned in Section 2.1, the VAE models often suffer from KL vanishing in language modeling. The existing methods cannot completely solve the KL vanishing problem because they blindly change in the VAE objective without monitoring the output KLdivergence during model training. In this paper, we adopt ControlVAE to control KLdivergence to a specified value to avoid KL vanishing using the output KLdivergence. Following PI tuning strategy in Section 3.1, we set , of the PI algorithm in (5) to and , respectively. In addition, is set to and the maximum value of is limited to .

Disentangling: We then apply the ControlVAE model to achieve a better tradeoff between reconstruction quality and disentangling. As mentioned in Section 2.2, VAE () assigns a large hyperparameter to the objective function to control the KLdivergence (information bottleneck), which, however, leads to a large reconstruction error. To mitigate this issue, we adopt ControlVAE to automatically adjust the hyperparameter based on the output KLdivergence during model training. Using the similar methodology in Burgess et al. (2018), we train a single model by gradually increasing KLdivergence from to a desired value . However, different from VAE that linearly increases , we adopt a step function to increase for every training steps in order to stabilize model training. Since , we set to for the PI algorithm in (5). Following the PI tuning method above, the coefficients and are set to and , respectively.

Image generation: The basic VAE models tend to produce blurry and unrealistic samples for image generation Zhao et al. (2017a). In this paper, we try to leverage ControlVAE to manipulate (slightly increase) the value of KLdivergence to improve the reconstruction quality of generated images. Different from the original VAE (), we extend the range of the hyperparameter, , from to in our controlVAE model. Given a desired KLdivergence, controlVAE can automatically tune within that range. For this task, we use the same PI control algorithm and hyperparameters as the above language modeling.
4 Experiments
We evaluate the performance of ControlVAE on realworld datasets in the three different applications mentioned above. Source code will be publicly available upon publication.
Methods/metric  Dis1  Dis2  Dis3  Dis4 

ControlVAEKL35  6.27K 41  95.86K 1.02K  274.36K 3.02K  405.65K 3.94K 
ControlVAEKL25  6.10K 60  83.15K 4.00K  244.29K 12.11K  385.46K 15.79K 
Cost annealing (KL = 17)  5.71K 87  69.60K 1.53K  208.62K 4.04K  347.65K 5.85K 
Cyclical (KL = 21.5)  5.79K 81  71.63K 2.04K  211.29K 6.38K  345.17K 11.65K 
4.1 Datasets
The datasets used for our experiments are introduced below.

[noitemsep]

Language modelling: 1) Penn Tree Bank (PTB) Marcus et al. (1993): it consists of training sentences, validation sentences and testing sentences. 2) Switchboard(SW) Godfrey and Holliman (1997): it has twosided telephone conversations with manually transcribed speech and alignment. The data is randomly split into , and dialog for training, validation and testing.

Image generation: 1) CelebA(cropped version) Liu et al. (2015): It has RGB images of celebrity faces. The data is split into and images for training and testing.
4.2 Model Configurations
The detailed model configurations and hyperparameter settings for each model is presented in Appendix A.
4.3 Evaluation on Language Modeling
First, we compare the performance of ControlVAE with the following baselines for mitigating KL vanishing in text generation Bowman et al. (2015). Cost annealing Bowman et al. (2015): This method gradually increases the hyperparameter on KLdivergence from until to after training steps using sigmoid function. Cyclical annealing Liu et al. (2019): This method splits the training process into cycles and each increases the hyperparameter from until to using a linear function.
Fig. 3 illustrates the comparison results of KLdivergence, reconstruction loss and hyperparamter for different methods on the PTB dataset. Note that, here ControlVAEKL means we set the KLdivergence to a desired value (e.g., 3) for our PI controller following the set point guidelines in Section 3.2. Costannealing means we increase the hyperparameter, , from until to after steps. We observe from Fig. 3(a) that ControlVAE (KL=1.5, 3) and Cyclical annealing ( cycles) can avert the KL vanishing. However, our ControlVAE is able to stabilize the KLdivergence while cyclical annealing could not. Moreover, our method has a lower reconstruction loss than the cyclical annealing in Fig. 3 (b). Cost annealing method still suffers from KL vanishing, because we use the Transformer Vaswani et al. (2017) as the decoder, which can predict the current data based on previous groundtruth data. Fig. 3 (c) illustrates the tuning result of by ControlVAE compared with other methods. We can discover that our gradually converges to around a certain value. Note that, here of ControlVAE does not converge to because we slightly increase the value of KLdivergence (produced by the original VAE) in order to improve the diversity of generated data.
In order to further demonstrate ControlVAE can improve the diversity of generated text, we apply it to dialogresponse generation using the Switchboard(SW) dataset. Following Zhao et al. (2017b), we adopt a conditional VAE Zhao et al. (2017b) that generates dialog conditioned on the previous response. According to literature Xu et al. (2018), metric , the number of distinct grams, is used to measure the diversity of generated data. Table 1 illustrates the comparison results for different approaches. We can see that ControlVAE has more distinct grams than the Cost annealing and Cyclical annealing when the desired KLdivergence is set to and . Thus, we can conclude that ControlVAE can improve the diversity of generated data by slightly increasing the KLdivergence of the original VAE. We also illustrate some examples of generated dialog by ControlVAE in Appendix B.
random seeds. ControlVAE (KL=16, 18) has lower reconstruction errors and variance compared to
VAE. (c) shows an example about the disentangled factors in the latent variable as the total KLdivergence increases from to for ControlVAE (KL=18). Each curve with positive KLdivergence (except black one) represents one disentangled factor by ControlVAE.4.4 Evaluation on Disentangled Representations
We then evaluate the performance of ControlVAE on the learning of disentangled representations using 2D Shapes data. We compare it with two baselines: FactorVAE Kim and Mnih (2018) and VAE Burgess et al. (2018).
Metric  ControlVAE (KL=16)  ControlVAE (KL=18)  VAE ()  FactorVAE () 
MIG  0.5519 0.0323  0.5146 0.0199  0.5084 0.0476  0.5139 0.0428 
Fig. 4 (a) and (b) shows the comparison of reconstruction error and the hyperparameter (using random seeds) for different models. We can observe from Fig. 4 (a) that ControlVAE (KL=16,18) has lower reconstruction error and variance than the baselines. This is because our ControlVAE automatically adjusts the hyperparameter, , to stabilize the KLdivergence, while the other two methods keep the hyperparameter unchanged during model training. Specifically, for ControlVAE (KL=18), the hyperparameter is high in the beginning in order to obtain good disentangling, and then it gradually drops to around as the training converges, as shown in Fig. 4(b). In contrast, VAE () has a large and fixed weight on the KLdivergence so that its optimization algorithm tends to optimize the KLdivergence term, leading to a large reconstruction error. In addition, Fig. 4(c) illustrates an example of KLdivergence per factor in the latent code as training progresses and the total information capacity (KLdivergence) increases from until to . We can see that ControlVAE disentangles all the five generative factors, starting from positional latents ( and ) to scale, followed by orientation and then shape.
To further demonstrate ControlVAE can achieve a better disentangling, we use a disentanglement metric, mutual information gap (MIG) Chen et al. (2018a), to compare their performance, as shown in Table 2. It can be observed that ControlVAE (KL=16) has a higher MIG score but lower variance than the other methods. Besides, we show the qualitative results of different models in Fig. 5. We can observe that ControlVAE can discover all the five generative factors: positional latent ( and ), scale, orientation and shape. However, VAE () disentangles four generative factors except for entangling the scale and shape together (in the third row), while FactorVAE () does not disentangle the orientation factor very well in the fourth row in Fig. 5. Thus, ControlVAE achieves a better reconstruction quality and disentangling than the baselines.
4.5 Evaluation on Image Generation
Finally, we compare the reconstruction quality of image generation for ControlVAE and the original VAE. Fig. 6 shows the comparison of reconstruction error and KLdivergence under different desired values of KLdivergence for random seeds. We can see from Fig. 6(a) that ControlVAEKL200 (KL=200) has the lowest reconstruction error among them. In addition, as we set the desired KLdivergence to (same as the basic VAE in Fig. 6(b)), ControlVAE has the same reconstruction error as the original VAE. At that point, ControlVAE becomes the original VAE as finally converges to , as shown in Fig. 7 in Appendix C.
We further adopt two commonly used metrics for image generation, FID Lucic et al. (2018) and SSIM Chen et al. (2018b), to evaluate the performance of ControlVAE in Table 3. It can be observed that ControlVAEKL200 outperforms the other methods in terms of FID and SSIM. Therefore, our ControlVAE can improve the reconstruction quality of generate images via controlling the value of KLdivergence. We also show some generated images to verify ControlVAE has a better reconstruction quality than the basic VAE in Appendix D.
Methods/metric  FID  SSIM 

ControlVAEKL200  55.16 0.187  0.687 0.0002 
ControlVAEKL180  57.57 0.236  0.679 0.0003 
ControlVAEKL170  58.75 0.286  0.675 0.0001 
Original VAE  58.71 0.207  0.675 0.0001 
5 Related Work
In language modeling, VAE often suffers from KL vanishing, due to a powerful decoder, such as Transformer Vaswani et al. (2017) and LSTM. In recent years, researchers develop many methods, such as KL cost annealing method Bowman et al. (2015), cyclical annealing and dilated CNN decoder Yang et al. (2017), to tackle this problem. However, these methods cannot totally solve the KL vanishing issue or explicitly control the KL divergence. On the contrary, our approach can avert KL vanishing by using a PI control algorithm to automatically tune the hyperparameter in the objective based on the output KL divergence.
Recently, researchers proposed a novel modification of VAE, called VAE () Higgins et al. (2017), to learn the disentangled representations. They assigned a large value to the hyperparameter to disentangle the generative factors. However, VAE sacrifice the reconstruction quality in order to obtain better disentangling. Then researchers developed other models, such as FactorVAE Kim and Mnih (2018); Kim et al. (2019) and TCVAE Chen et al. (2018a), to improve the reconstruction quality. However, the drawback of these methods is that they assign a fixed hyperparameter to the KL term or the decomposed terms in the objective. In contrast, our ControlVAE can automatically tune the hyperparameter during optimization to stabilize the KL divergence, which can also be used as a plugin replacement of existing methods.
VAE and its variants are also applied to generate fake images, but the generated samples are blurry and unrealistic Zhao et al. (2017a). In order to improve its performance, researchers developed a new variational lossy autoencoder (VLAE) by borrowing the idea from autoregressive flow. However, the computational complexity of VLAE is pretty expensive. Besides, researchers adopted a constrained optimization for reconstruction Rezende and Viola (2018); Klushyn et al. (2019) to achieve the tradeoff between reconstruction error and KLdivergence. However, these methods may suffer from posterior collapse if the inference network fails to cover the latent space. Recent studies mainly adopt generative adversarial networks (GANs) Goodfellow et al. (2014); Zhu et al. (2017); Radford et al. (2015); Arjovsky et al. (2017) to improve the quality of generated images. However, it is very difficult to train GANs because they easily suffers from collapse. Different from existing methods, we add an additional hyperparameter on the KL term, and then leverage ControlVAE to manipulate the KL divergence to reduce the reconstruction error.
6 Conclusion
In this paper, we proposed a general controllable VAE framework, ControlVAE, that combines automatic control with the basic VAE framework to improve the performance of the VAE models. We designed a new nonlinear PI controller to control the value of KL divergence during model training. Then we evaluated ControlVAE on three different tasks. The results show that ControlVAE attains better performance; it improves ability to disentangle latent factors. It averts KL vanishing in language modeling. It improves the reconstruction quality for image generation as well. Other applications are a topic of the authors’ future research.
References
 Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §5.
 Advanced pid control. Vol. 461, ISAThe Instrumentation, Systems, and Automation Society Research Triangle …. Cited by: §2.
 Design and modeling of anti wind up pid controllers. In Complex system modelling and control through intelligent soft computations, pp. 1–44. Cited by: §3.3.
 Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2.2.
 Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §A.1, §1, §2.1, §2, §4.3, §5.
 Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1, §2.2, §2, 2nd item, Figure 5, §4.4.
 Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.2, §4.4, §5.

Attentiongan for object transfiguration in wild images.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 164–180. Cited by: §4.5.  Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §1.
 Switchboard1 release 2linguistic data consortium. SWITCHBOARD: A User’s Manual. Cited by: 1st item.
 Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.
 Feedback control of computing systems. John Wiley & Sons. Cited by: §2.
 Betavae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §A.2, §1, §2.2, §2, §5.
 Texar: a modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations, Cited by: §A.1.

Toward controlled generation of text.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 1587–1596. Cited by: §1.  Variational autoencoder with arbitrary conditioning. arXiv preprint arXiv:1806.02382. Cited by: §2.1.
 Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §1, §2, 2nd item, §4.4, §5.
 Relevance factor vae: learning and identifying disentangled factors. arXiv preprint arXiv:1902.01568. Cited by: §5.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.1.
 Learning hierarchical priors in vaes. In Advances in Neural Information Processing Systems, pp. 2866–2875. Cited by: §5.

Unsupervised imagetoimage translation networks
. In Advances in neural information processing systems, pp. 700–708. Cited by: §1.  Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §2.1, §2, §3.1, §4.3.
 Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: 3rd item.
 Are gans created equal? a largescale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §4.5.
 Building a large annotated corpus of english: the penn treebank. Cited by: 1st item.
 Dsprites: disentanglement testing sprites dataset. URL https://github. com/deepmind/dspritesdataset/.[Accessed on: 20180508]. Cited by: 2nd item.
 Antiwindup, bumpless, and conditioned transfer techniques for pid controllers. IEEE Control systems magazine 16 (4), pp. 48–57. Cited by: §3.3.
 Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.
 Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §2.1.
 Taming vaes. arXiv preprint arXiv:1810.00597. Cited by: §5.
 Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.3, §5.
 Topicguided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137. Cited by: §1.
 Dpgan: diversitypromoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345. Cited by: §4.3.
 Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §1.
 Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3881–3890. Cited by: §5.
 Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: 3rd item, §5.
 Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. Cited by: §A.1, §4.3.
 Unpaired imagetoimage translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §5.
Appendix A Model Configurations and hyperparameter settings
We summarize the detailed model configurations and hyperparameter settings for ControlVAE in three applications below. Our source code is already submitted to the review system.
a.1 Experimental Details for Language Modeling
For text generation on PTB data, we build the ControlVAE model on the basic VAE model, as in Bowman et al. [2015]
. We use onelayer LSTM as the encoder and a threelayer Transformer with eight heads as the decoder and a MultiLayer Perceptron (MLP) to learn the latent variable
. The maximum sequence length for LSTM and Transformer is set to , respectively. And the size of latent variable is set to . Then we set the dimension of word embedding to and the batch size to . In addition, the dropout is for LSTM and Transformer. Adam optimization with the learning rate is used during training. Following the tuning guidelines above, we set the coefficients and of P term and I term to and , respectively. Finally, We adopt the source code on Texar platform to implement experiments Hu et al. [2019].For dialogresponse generation, we follow the model architecture and hyperparameters of the basic conditional VAE in Zhao et al. [2017b]. We use onelayer Bidirectional GRU as the encoder and onelayer GRU as the decoder and two fullyconnected layers to learn the latent variable. In the experiment, the size of both latent variable and word embeddings is set to . The maximum length of input/output sequence for GRU is set to with batch size . In addition, Adam with initial learning rate is used. In addition, we set the same and of PI algorithm as text generation above. The model architectures of ControlVAE for these two NLP tasks are illustrated in Table 4, 5.
Encoder  Decoder 

Input words  Input , 
1layer LSTM  FC 
FC  3layer Transformer 8 heads 
Encoder  Decoder 

Input words  Input 
1layer biGRU  FC 
FC  1layer GRU 
FC 
a.2 Experimental Details for Disentangling
Following the same model architecture of VAE Higgins et al. [2017], we adopt a convolutional layer and deconvolutional layer for our experiments. We use Adam optimizer with , and a learning rate tuned from . We set and for PI algorithm to and , respectively. For the step function, we set the step, , to per training steps as the information capacity (desired KL divergence) increases from until for 2D Shape data. ControlVAE uses the same encoder and decoder architecture as VAE except for plugging in PI control algorithm, illustrated in Table 6.
Encoder  Decoder 

Input binary image  Input 
conv. ReLU. stride 2  FC. 256 ReLU. 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
conv. ReLU. stride 2  upconv. ReLU. stride 2. 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
FC . FC.  upconv. ReLU. stride 2 
a.3 Experimental Details for Image Generation
Similar to the architecture of
VAE, we use a convolutional layer with batch normalization as the encoder and a deconvolutional layer with batch normalization for our experiments. We use Adam optimizer with
, and a learning rate for CelebA data. The size of latent variable is set to , because we find it has a better reconstruction quality than and . In addition, we set the desired value of KLdivergence to (same as the original VAE), , and . For PI control algorithm, we set and to and , respectively. We also use the same encoder and decoder architecture as VAE above except that we add the batch normalization to improve the stability of model training, as shown in Table 7.Encoder  Decoder 

Input RGB image  Input 
conv. ReLU. stride 2  FC. 256 ReLU. 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
conv. ReLU. stride 2  upconv. ReLU. stride 2. 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
conv. ReLU. stride 2  upconv. ReLU. stride 2 
FC . FC.  upconv. ReLU. stride 2 
Appendix B Examples of Generated Dialog by ControlVAE
In this section, we show an example to compare the diversity and relevance of generated dialog by different methods, as illustrated in Table 8. Alice begins with the openended conversation on choosing a college. Our model tries to predict the response from Bob. The ground truth response is “um  hum”. We can observe from Table 8 that ControlVAE (KL=25, 35) can generate diverse and relevant response compared with the ground truth. In addition, while cyclical annealing can generate diverse text, some of them are not very relevant to the groundtruth response.
Context: (Alice) and a lot of the students in that home town sometimes unk the idea of staying and going to school across the street so to speak  
Topic: Choosing a college Target: (Bob) um  hum  
ControlVAEKL25  ControlVAEKL35  Cost annealing (KL=17)  Cyclical anneal (KL=21.5) 
yeah  uh  huh  oh yeah  yeah that’s true do you do you do it 
um  hum  yeah  uh  huh  yeah 
oh that’s right um  hum  oh yeah oh absolutely  right  um  hum 
yes  right  uh  huh and i think we have to be together  yeah that’s a good idea 
right  um  hum  oh well that’s neat yeah well  yeah i see it too,it’s a neat place 
Appendix C of ControlVAE for Image Generation on CelebA data
Fig. 7 illustrates the comparison of for different methods during model training. We can observe that finally converges to when the desired value of KLdivergence is set to , same as the original VAE. At this point, ControlVAE becomes the original VAE. Thus, ControlVAE can be customized by users based on different applications.
Appendix D Examples of Generated Images by VAE and ControlVAE
We also show the some generated images by ControlVAE and the original VAE in Fig. 8. It can be observed that images generated by ControlVAEKL200 (KL = ) has the best reconstruction quality compared to the original VAE. Take the woman in the first row last column as an example. The woman does not show her teeth in the groundtruth image. However, we can see the woman generated by the original VAE smiles with mouth opening. In contrast, the woman generated by ControlVAEKL200 hardly show her teeth when smiling. In addition, we also discover from the other two examples marked with blue rectangles that ControlVAEKL200 can better reconstruct the “smile” from the man and the “ear” from the woman compared to the original VAE. Therefore, we can conclude that our ControlVAE can improve the reconstruction quality via slightly increasing (control) KLdivergence compared to the original VAE. It should be pointed out that the comparison results are not very obvious because we use a simple VAE model in the experiments. For future work, we are going to adopt advanced VAE models to improve the performance.