This paper proposes a novel controllable variational autoencoder, ControlVAE, that leverages automatic control to precisely control the trade-off between data reconstruction accuracy bounds (from a learned latent representation) and application-specific constraints, such as output diversity or disentangled latent factor representation. Specifically, a controller is designed that stabilizes the value of KL-divergence (between the learned approximate distribution of the latent variables and their true distribution) in the VAE’s objective function to achieve the desired trade-off, thereby improving application-specific performance metrics of several existing VAE models.
The work is motivated by the increasing popularity of VAEs as an unsupervised generative modeling framework that learns an approximate mapping between Gaussian latent variables and data samples when the true latent variables have an intractable posterior distribution Sohn et al. (2015); Kingma and Welling (2013). Since VAEs can directly work with both continuous and discrete input data Kingma and Welling (2013), they have been widely adopted in various applications, such as image generation Yan et al. (2016); Liu et al. (2017), dialog generation Wang et al. (2019); Hu et al. (2017), and disentangled representation learning Higgins et al. (2017); Kim and Mnih (2018).
Popular VAE applications often involve a trade-off between reconstruction accuracy bounds and some other application-specific goal, effectively manipulated through KL-divergence. For example, in (synthetic) text or image generation, a goal is to produce new original text or images, as opposed to reproducing one of the samples in training data. In text generation, if KL-divergence is too low, output diversity is reduced Bowman et al. (2015), which is known as the KL-vanishing problem. To increase output diversity, it becomes advantageous to artificially increase KL-divergence. The resulting approximation was shown to produce more diverse, yet still authentic-looking outputs. Conversely, disentangled representation learning Denton and others (2017) leverages the observation that KL-divergence in the VAE constitutes an upper bound on information transfer through the latent channels per data sample Burgess et al. (2018). Artificially decreasing KL-divergence (e.g., by increasing its weight in a VAE’s objective function, which is known as the -VAE) therefore imposes a stricter information bottleneck, which was shown to force the learned latent factors to become more independent (i.e., non-redundant), leading to a better disentangling. The above examples suggest that a useful extension of VAEs is one that allows users to exercise explicit control over KL-divergence in the objective function. ControlVAE realizes this extension.
We apply ControlVAE to three different applications: language modeling, disentangling, and image generation. Evaluation results on real-world datasets demonstrate that ControlVAE is able to achieve an adjustable trade-off between reconstruction error and KL-divergence. It can discover more disentangled factors and significantly reduce the reconstruction error compared to the -VAE Burgess et al. (2018) for disentangling. For language modeling, it can not only completely avert the KL vanishing problem, but also improve the diversity of generated data. Finally, we also show that ControlVAE improves the quality of synthetic image generation via slightly increasing the value of KL divergence compared with the original VAE.
The objective function of VAEs consists of two terms: log-likelihood and KL-divergence. The first term tries to reconstruct the input data, while KL-divergence has the desirable effect of keeping the representation of input data sufficiently diverse. In particular, KL-divergence can affect both the reconstruction quality and diversity of generated data. If the KL-divergence is too high, it would affect the accuracy of generated samples. If it is too low, output diversity is reduced, which may be a problem in some applications such as language modeling Bowman et al. (2015) (where it is known as the KL-vanishing problem).
To mitigate KL vanishing, one promising way is to add an extra hyperparameter in the VAE objective function to control the KL-divergence via increasing from until to
with sigmoid function or cyclic functionLiu et al. (2019). These methods, however, blindly change without sampling the actual KL-divergence during model training. Using a similar methodology, researchers recently developed a new -VAE () Higgins et al. (2017); Burgess et al. (2018) to learn the disentangled representations by controlling the value of KL-divergence. However, -VAE suffers from high reconstruction errors Kim and Mnih (2018), because it adds a very large in the VAE objective so the model tends to focus disproportionately on optimizing the KL term. In addition, its hyperparameter is fixed during model training, missing the chance of balancing the reconstruction error and KL-divergence.
The core technical challenge responsible for the above application problems lies in the difficulty to tune the weight of the KL-divergence term during model training. Inspired by control systems, we fix this problem using feedback control. Our controllable variational autoencoder is illustrated in Fig. 1. It samples the output KL-divergence at each training step , and feeds it into an algorithm that tunes the hyperparameter, , accordingly, aiming to stabilize KL-divergence at a desired value, called the set point.
We further design a non-linear PI controller, a variant of the PID control algorithm Åström et al. (2006), to tune the hyperparameter . PID control is the basic and most prevalent form of feedback control in a large variety of industrial Åström et al. (2006) and software performance control Hellerstein et al. (2004) applications. The basic idea of the PID algorithm is to calculate an error, , between a set point (in this case, the desired KL-divergence) and the current value of the controlled variable (in this case, the actual KL-divergence), then apply a correction in a direction that reduces that error. The correction is applied to some intermediate directly accessible variable (in our case, ) that influences the value of the variable we ultimately want to control (KL-divergence). In general, the correction computed by the controller is the weighted sum of three terms; one changes with error (P), one changes with the integral of error (I), and one changes with the derivative of error (D). In a nonlinear controller, the changes can be described by nonlinear functions. Note that, since derivatives essentially compute the slope of a signal, when the signal is noisy, the slope often responds more to variations induced by noise. Hence, following established best practices in control of noisy systems, we do not use the derivative (D) term in our specific controller. Next, we introduce VAEs and our objective in more detail.
2.1 The Variational Autoencoder (VAE)
Suppose that we have a dataset of i.i.d. samples that are generated by the ground-truth latent variable , interpreted as the representation of the data. Let denote a probabilistic decoder
with a neural network to generate datagiven the latent variable . The distribution of representation corresponding to the dataset is approximated by the variational posterior, , which is produced by an encoder with a neural network. The Variational Autoencoder (VAE) Rezende et al. (2014); Kingma and Welling (2013) has been one of the most popular generative models. The basic idea of VAE can be summarized in the following: (1) VAE encodes the input data samples into a latent variable as its distribution of representation via a probabilistic encoder, which is parameterised by a neural network. (2) then adopts the decoder to reconstruct the original input data based on the samples from 2013) to optimize its variational lower bound of log likelihood Kingma and Welling (2013).
where is the prior distribution, e.g., standard Gaussian. and denote the distribution parameterized by a neural network with the corresponding parameter and , respectively. The first term in (1) is reconstruction term while the latter term is called KL-divergence. In addition, a reparameterization trick is used to calculate the gradient of lower bound with respect to Ivanov et al. (2018). It is defined by , where .
However, the basic VAE models cannot explicitly control the KL-divergence to a specified value. They also often suffer from KL vanishing (in language modeling Bowman et al. (2015); Liu et al. (2019)), which means the KL-divergence becomes zero during optimization. To remedy this issue, one popular way is to add a hyperparameter on the KL term Bowman et al. (2015); Liu et al. (2019), and then gradually increases it from until . However, the existing methods, such as KL cost annealing and cyclical annealing Bowman et al. (2015); Liu et al. (2019), cannot totally avert KL vanishing because they blindly vary the hyperparameter during model training.
-VAE Higgins et al. (2017); Chen et al. (2018a) is an extension to the basic VAE framework, often used as an unsupervised method for learning a disentangled representation of the data generative factors. A disentangled representation, according to the literature Bengio et al. (2013), is defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Compared to the original VAE, -VAE adds an extra hyperparameter as a weight of KL-divergence in the original VAE objective (1). It can be expressed by
In order to discover more disentangled factors, researchers further put a constraint on total information capacity, , to control the capacity of the information bottleneck (KL-divergence) Burgess et al. (2018). Then Lagrangian method is adopted to solve the following optimization problem.
where is a large hyperparameter (e.g., 100).
However, one drawback of -VAE is that it obtains good disentangling at the cost of reconstruction quality. When the weight is large, the optimization algorithm tends to optimize the second term in (3), leading to a high reconstruction error.
The above background suggests that a common challenge in applying VAEs (and their extensions) lies in appropriate weight allocation among the reconstruction accuracy and KL-divergence in the VAEs objective function. As mentioned earlier, we solve this using a nonlinear PI controller that manipulates the value of the non-negative hyperparameter, . This algorithm is described next.
3 The ControlVAE Algorithm
During model training, we sample the output KL-divergence, which we denote by , at training step . The sampled KL-divergence is then compared to the set point, , and the difference, then used as the feedback to a controller to calculate the hyperparameter . ControlVAE can be expressed by the following variational lower bound:
When KL-divergence drops below the set point, the controller counteracts this change by reducing the hyperparameter (to reduce penalty for KL-divergence in the objective function (4)). The reduced weight, , allows KL-divergence to grow, thus approaching the set point again. Conversely, when KL-divergence grows above the set point, the controller increases (up to a certain value), thereby increasing the penalty for KL-divergence and forcing it to decrease. This effect is achieved by computing using Equation (5), below, which is an instance of nonlinear PI control:
where and are the constants. The first term (on the right hand side) ranges between and thanks to the exponential function . Note that when error is large and positive (KL-diverge is below set point), the first term approaches 0, leading to a lower that encourages KL-divergence to grow. Conversely, when error is large and negative (KL-divergence above set point), the first term approaches its maximum (which is ), leading to a higher that encourages KL-divergence to shrink.
The second term of the controller sums (integrates) past errors with a sampling period (one training step in this paper). This creates a progressively stronger correction (until the sign of the error changes). The negative sign ensures that while errors remain positive (i.e., when KL-divergence is below set point), this term continues to decrease, whereas while errors remain negative (i.e., when KL-divergence is above set point), this term continues to increase. In both cases, the change forces in a direction that helps KL-divergence approach the set point. In particular, note that when the error becomes zero, the second term (and thus the entire right hand side) stops changing, allowing controller output, , to stay at the same value that hopefully caused the zero error in the first place. This allows the controller to “lock in" the value of that meets the KL-divergence set point. Finally, is an application-specific constant. It effectively shifts the range within which is allowed to vary. This PI controller is illustrated in Fig. 2.
3.1 PI Parameter Tuning for ControlVAE
One challenge of applying the PI control algorithm lies how to tune its parameters, and effectively. While optimal tuning of nonlinear controllers is non-trivial, in this paper we follow a very simple rule: tune these constants to ensure that reactions to errors are sufficiently smooth to allow gradual convergence. Let us first consider the coefficient . Observe that the maximum (positive) error occurs when actual KL-divergence is close to zero. In this case, if is the set point on KL-divergence, then the error, , is approximated by . When KL-divergence is too small, the VAE does not learn useful information from input data Liu et al. (2019). We need to assign a very small non-negative value, so that KL-divergence is encouraged to grow (when the resulting objective function is optimized). In other words, temporarily ignoring other terms in Equation (5), the contribution of the first term alone should be sufficiently small:
where is a small constant (e.g., in our implementation). The above (6) can also be rewritten as . Empirically, we find that leads to good performance and satisfies the above constraint.
Conversely, when the actual KL-divergence is much larger than the desired value , the error becomes a large negative value. As a result, the first term in (5) becomes close to a constant, . If the resulting larger value of is not enough to cause KL-divergence to shrink, one needs to gradually continue to increase . This is the job of second term. The negative sign in front of that term ensures that when negative errors continue to accumulate, the positive output continues to increase. Since it takes lots of steps to train deep VAE models, the increase per step should be very small, favoring smaller values of . Empirically we found that a value between and stabilizes the training. Note that, should not be too small either, because it would then unnecessarily slow down the convergence.
3.2 Set Point Guidelines for ControlVAE
For generative models, human graders are often needed to evaluate the generated samples, so it is very hard to get the optimal set point of KL-divergence for ControlVAE. Nevertheless, rules of thumb may apply from a controllability perspective. Note that, as we allude to earlier, is application-specific. In general, when , the upper bound of expected KL-divergence is the value of KL-divergence as ControlVAE converges when , denoted by . Similarly, its lower bound, , can be defined as the KL-divergence produced by ControlVAE when . For feedback control to be most effective (i.e., not run against the above limits), the KL-divergence set point should be somewhere in the middle between these extremes. The closer it is to an extreme, the worse is controllability in one of the directions. Finally, if the set point is outside the interval , then manipulating within the interval will be ineffective at maintaining KL-divergence at that set point.
3.3 Summary of the PI Control Algorithm
We summarize the proposed PI control algorithm in Algorithm 1. Our PI algorithm updates the hyperparameter, , with the feedback from sampled KL-divergence at training step . Line computes the error between the desired KL-divergence, , and the sampled . Line to calculate the P term and I term for the PI algorithm, respectively. Note that, Line 10 and 11 is a popular constraint in PID/PI design, called anti-windup (Azar and Serrano, 2015; Peng et al., 1996). It effectively disables the integral term of the controller when controller output gets out of range, not to exacerbate the out-of-range deviation. Line is the calculated hyperparameter by PI algorithm in (5). Finally, Line to aim to limit to a certain range, .
3.4 Applications of ControlVAE
As a preliminary demonstration of the general applicability of the above approach and as an illustration of its customizability, we apply ControlVAE to three different applications stated below.
Language modeling: We first apply ControlVAE to solve the KL vanishing problem meanwhile improve the diversity of generated data. As mentioned in Section 2.1, the VAE models often suffer from KL vanishing in language modeling. The existing methods cannot completely solve the KL vanishing problem because they blindly change in the VAE objective without monitoring the output KL-divergence during model training. In this paper, we adopt ControlVAE to control KL-divergence to a specified value to avoid KL vanishing using the output KL-divergence. Following PI tuning strategy in Section 3.1, we set , of the PI algorithm in (5) to and , respectively. In addition, is set to and the maximum value of is limited to .
Disentangling: We then apply the ControlVAE model to achieve a better trade-off between reconstruction quality and disentangling. As mentioned in Section 2.2, -VAE () assigns a large hyperparameter to the objective function to control the KL-divergence (information bottleneck), which, however, leads to a large reconstruction error. To mitigate this issue, we adopt ControlVAE to automatically adjust the hyperparameter based on the output KL-divergence during model training. Using the similar methodology in Burgess et al. (2018), we train a single model by gradually increasing KL-divergence from to a desired value . However, different from -VAE that linearly increases , we adopt a step function to increase for every training steps in order to stabilize model training. Since , we set to for the PI algorithm in (5). Following the PI tuning method above, the coefficients and are set to and , respectively.
Image generation: The basic VAE models tend to produce blurry and unrealistic samples for image generation Zhao et al. (2017a). In this paper, we try to leverage ControlVAE to manipulate (slightly increase) the value of KL-divergence to improve the reconstruction quality of generated images. Different from the original VAE (), we extend the range of the hyperparameter, , from to in our controlVAE model. Given a desired KL-divergence, controlVAE can automatically tune within that range. For this task, we use the same PI control algorithm and hyperparameters as the above language modeling.
We evaluate the performance of ControlVAE on real-world datasets in the three different applications mentioned above. Source code will be publicly available upon publication.
|ControlVAE-KL-35||6.27K 41||95.86K 1.02K||274.36K 3.02K||405.65K 3.94K|
|ControlVAE-KL-25||6.10K 60||83.15K 4.00K||244.29K 12.11K||385.46K 15.79K|
|Cost annealing (KL = 17)||5.71K 87||69.60K 1.53K||208.62K 4.04K||347.65K 5.85K|
|Cyclical (KL = 21.5)||5.79K 81||71.63K 2.04K||211.29K 6.38K||345.17K 11.65K|
The datasets used for our experiments are introduced below.
Language modelling: 1) Penn Tree Bank (PTB) Marcus et al. (1993): it consists of training sentences, validation sentences and testing sentences. 2) Switchboard(SW) Godfrey and Holliman (1997): it has two-sided telephone conversations with manually transcribed speech and alignment. The data is randomly split into , and dialog for training, validation and testing.
Image generation: 1) CelebA(cropped version) Liu et al. (2015): It has RGB images of celebrity faces. The data is split into and images for training and testing.
4.2 Model Configurations
The detailed model configurations and hyperparameter settings for each model is presented in Appendix A.
4.3 Evaluation on Language Modeling
First, we compare the performance of ControlVAE with the following baselines for mitigating KL vanishing in text generation Bowman et al. (2015). Cost annealing Bowman et al. (2015): This method gradually increases the hyperparameter on KL-divergence from until to after training steps using sigmoid function. Cyclical annealing Liu et al. (2019): This method splits the training process into cycles and each increases the hyperparameter from until to using a linear function.
Fig. 3 illustrates the comparison results of KL-divergence, reconstruction loss and hyperparamter for different methods on the PTB dataset. Note that, here ControlVAE-KL- means we set the KL-divergence to a desired value (e.g., 3) for our PI controller following the set point guidelines in Section 3.2. Cost-annealing- means we increase the hyperparameter, , from until to after steps. We observe from Fig. 3(a) that ControlVAE (KL=1.5, 3) and Cyclical annealing ( cycles) can avert the KL vanishing. However, our ControlVAE is able to stabilize the KL-divergence while cyclical annealing could not. Moreover, our method has a lower reconstruction loss than the cyclical annealing in Fig. 3 (b). Cost annealing method still suffers from KL vanishing, because we use the Transformer Vaswani et al. (2017) as the decoder, which can predict the current data based on previous ground-truth data. Fig. 3 (c) illustrates the tuning result of by ControlVAE compared with other methods. We can discover that our gradually converges to around a certain value. Note that, here of ControlVAE does not converge to because we slightly increase the value of KL-divergence (produced by the original VAE) in order to improve the diversity of generated data.
In order to further demonstrate ControlVAE can improve the diversity of generated text, we apply it to dialog-response generation using the Switchboard(SW) dataset. Following Zhao et al. (2017b), we adopt a conditional VAE Zhao et al. (2017b) that generates dialog conditioned on the previous response. According to literature Xu et al. (2018), metric -, the number of distinct grams, is used to measure the diversity of generated data. Table 1 illustrates the comparison results for different approaches. We can see that ControlVAE has more distinct grams than the Cost annealing and Cyclical annealing when the desired KL-divergence is set to and . Thus, we can conclude that ControlVAE can improve the diversity of generated data by slightly increasing the KL-divergence of the original VAE. We also illustrate some examples of generated dialog by ControlVAE in Appendix B.
random seeds. ControlVAE (KL=16, 18) has lower reconstruction errors and variance compared to-VAE. (c) shows an example about the disentangled factors in the latent variable as the total KL-divergence increases from to for ControlVAE (KL=18). Each curve with positive KL-divergence (except black one) represents one disentangled factor by ControlVAE.
4.4 Evaluation on Disentangled Representations
We then evaluate the performance of ControlVAE on the learning of disentangled representations using 2D Shapes data. We compare it with two baselines: FactorVAE Kim and Mnih (2018) and -VAE Burgess et al. (2018).
|Metric||ControlVAE (KL=16)||ControlVAE (KL=18)||-VAE ()||FactorVAE ()|
|MIG||0.5519 0.0323||0.5146 0.0199||0.5084 0.0476||0.5139 0.0428|
Fig. 4 (a) and (b) shows the comparison of reconstruction error and the hyperparameter (using random seeds) for different models. We can observe from Fig. 4 (a) that ControlVAE (KL=16,18) has lower reconstruction error and variance than the baselines. This is because our ControlVAE automatically adjusts the hyperparameter, , to stabilize the KL-divergence, while the other two methods keep the hyperparameter unchanged during model training. Specifically, for ControlVAE (KL=18), the hyperparameter is high in the beginning in order to obtain good disentangling, and then it gradually drops to around as the training converges, as shown in Fig. 4(b). In contrast, -VAE () has a large and fixed weight on the KL-divergence so that its optimization algorithm tends to optimize the KL-divergence term, leading to a large reconstruction error. In addition, Fig. 4(c) illustrates an example of KL-divergence per factor in the latent code as training progresses and the total information capacity (KL-divergence) increases from until to . We can see that ControlVAE disentangles all the five generative factors, starting from positional latents ( and ) to scale, followed by orientation and then shape.
To further demonstrate ControlVAE can achieve a better disentangling, we use a disentanglement metric, mutual information gap (MIG) Chen et al. (2018a), to compare their performance, as shown in Table 2. It can be observed that ControlVAE (KL=16) has a higher MIG score but lower variance than the other methods. Besides, we show the qualitative results of different models in Fig. 5. We can observe that ControlVAE can discover all the five generative factors: positional latent ( and ), scale, orientation and shape. However, -VAE () disentangles four generative factors except for entangling the scale and shape together (in the third row), while FactorVAE () does not disentangle the orientation factor very well in the fourth row in Fig. 5. Thus, ControlVAE achieves a better reconstruction quality and disentangling than the baselines.
4.5 Evaluation on Image Generation
Finally, we compare the reconstruction quality of image generation for ControlVAE and the original VAE. Fig. 6 shows the comparison of reconstruction error and KL-divergence under different desired values of KL-divergence for random seeds. We can see from Fig. 6(a) that ControlVAE-KL-200 (KL=200) has the lowest reconstruction error among them. In addition, as we set the desired KL-divergence to (same as the basic VAE in Fig. 6(b)), ControlVAE has the same reconstruction error as the original VAE. At that point, ControlVAE becomes the original VAE as finally converges to , as shown in Fig. 7 in Appendix C.
We further adopt two commonly used metrics for image generation, FID Lucic et al. (2018) and SSIM Chen et al. (2018b), to evaluate the performance of ControlVAE in Table 3. It can be observed that ControlVAE-KL-200 outperforms the other methods in terms of FID and SSIM. Therefore, our ControlVAE can improve the reconstruction quality of generate images via controlling the value of KL-divergence. We also show some generated images to verify ControlVAE has a better reconstruction quality than the basic VAE in Appendix D.
|ControlVAE-KL-200||55.16 0.187||0.687 0.0002|
|ControlVAE-KL-180||57.57 0.236||0.679 0.0003|
|ControlVAE-KL-170||58.75 0.286||0.675 0.0001|
|Original VAE||58.71 0.207||0.675 0.0001|
5 Related Work
In language modeling, VAE often suffers from KL vanishing, due to a powerful decoder, such as Transformer Vaswani et al. (2017) and LSTM. In recent years, researchers develop many methods, such as KL cost annealing method Bowman et al. (2015), cyclical annealing and dilated CNN decoder Yang et al. (2017), to tackle this problem. However, these methods cannot totally solve the KL vanishing issue or explicitly control the KL divergence. On the contrary, our approach can avert KL vanishing by using a PI control algorithm to automatically tune the hyperparameter in the objective based on the output KL divergence.
Recently, researchers proposed a novel modification of VAE, called -VAE () Higgins et al. (2017), to learn the disentangled representations. They assigned a large value to the hyperparameter to disentangle the generative factors. However, -VAE sacrifice the reconstruction quality in order to obtain better disentangling. Then researchers developed other models, such as FactorVAE Kim and Mnih (2018); Kim et al. (2019) and TCVAE Chen et al. (2018a), to improve the reconstruction quality. However, the drawback of these methods is that they assign a fixed hyperparameter to the KL term or the decomposed terms in the objective. In contrast, our ControlVAE can automatically tune the hyperparameter during optimization to stabilize the KL divergence, which can also be used as a plug-in replacement of existing methods.
VAE and its variants are also applied to generate fake images, but the generated samples are blurry and unrealistic Zhao et al. (2017a). In order to improve its performance, researchers developed a new variational lossy autoencoder (VLAE) by borrowing the idea from autoregressive flow. However, the computational complexity of VLAE is pretty expensive. Besides, researchers adopted a constrained optimization for reconstruction Rezende and Viola (2018); Klushyn et al. (2019) to achieve the trade-off between reconstruction error and KL-divergence. However, these methods may suffer from posterior collapse if the inference network fails to cover the latent space. Recent studies mainly adopt generative adversarial networks (GANs) Goodfellow et al. (2014); Zhu et al. (2017); Radford et al. (2015); Arjovsky et al. (2017) to improve the quality of generated images. However, it is very difficult to train GANs because they easily suffers from collapse. Different from existing methods, we add an additional hyperparameter on the KL term, and then leverage ControlVAE to manipulate the KL divergence to reduce the reconstruction error.
In this paper, we proposed a general controllable VAE framework, ControlVAE, that combines automatic control with the basic VAE framework to improve the performance of the VAE models. We designed a new non-linear PI controller to control the value of KL divergence during model training. Then we evaluated ControlVAE on three different tasks. The results show that ControlVAE attains better performance; it improves ability to disentangle latent factors. It averts KL vanishing in language modeling. It improves the reconstruction quality for image generation as well. Other applications are a topic of the authors’ future research.
- Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §5.
- Advanced pid control. Vol. 461, ISA-The Instrumentation, Systems, and Automation Society Research Triangle …. Cited by: §2.
- Design and modeling of anti wind up pid controllers. In Complex system modelling and control through intelligent soft computations, pp. 1–44. Cited by: §3.3.
- Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §2.2.
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §A.1, §1, §2.1, §2, §4.3, §5.
- Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599. Cited by: §1, §1, §2.2, §2, 2nd item, Figure 5, §4.4.
- Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §2.2, §4.4, §5.
Attention-gan for object transfiguration in wild images.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 164–180. Cited by: §4.5.
- Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §1.
- Switchboard-1 release 2-linguistic data consortium. SWITCHBOARD: A User’s Manual. Cited by: 1st item.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.
- Feedback control of computing systems. John Wiley & Sons. Cited by: §2.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §A.2, §1, §2.2, §2, §5.
- Texar: a modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations, Cited by: §A.1.
Toward controlled generation of text.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587–1596. Cited by: §1.
- Variational autoencoder with arbitrary conditioning. arXiv preprint arXiv:1806.02382. Cited by: §2.1.
- Disentangling by factorising. In International Conference on Machine Learning, pp. 2654–2663. Cited by: §1, §2, 2nd item, §4.4, §5.
- Relevance factor vae: learning and identifying disentangled factors. arXiv preprint arXiv:1902.01568. Cited by: §5.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.1.
- Learning hierarchical priors in vaes. In Advances in Neural Information Processing Systems, pp. 2866–2875. Cited by: §5.
Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §1.
- Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §2.1, §2, §3.1, §4.3.
- Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: 3rd item.
- Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §4.5.
- Building a large annotated corpus of english: the penn treebank. Cited by: 1st item.
- Dsprites: disentanglement testing sprites dataset. URL https://github. com/deepmind/dsprites-dataset/.[Accessed on: 2018-05-08]. Cited by: 2nd item.
- Anti-windup, bumpless, and conditioned transfer techniques for pid controllers. IEEE Control systems magazine 16 (4), pp. 48–57. Cited by: §3.3.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.
- Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: §2.1.
- Taming vaes. arXiv preprint arXiv:1810.00597. Cited by: §5.
- Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.3, §5.
- Topic-guided variational autoencoders for text generation. arXiv preprint arXiv:1903.07137. Cited by: §1.
- Dp-gan: diversity-promoting generative adversarial network for generating informative and diversified text. arXiv preprint arXiv:1802.01345. Cited by: §4.3.
- Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: §1.
- Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3881–3890. Cited by: §5.
- Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: 3rd item, §5.
- Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. Cited by: §A.1, §4.3.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §5.
Appendix A Model Configurations and hyperparameter settings
We summarize the detailed model configurations and hyperparameter settings for ControlVAE in three applications below. Our source code is already submitted to the review system.
a.1 Experimental Details for Language Modeling
For text generation on PTB data, we build the ControlVAE model on the basic VAE model, as in Bowman et al. 
. We use one-layer LSTM as the encoder and a three-layer Transformer with eight heads as the decoder and a Multi-Layer Perceptron (MLP) to learn the latent variable. The maximum sequence length for LSTM and Transformer is set to , respectively. And the size of latent variable is set to . Then we set the dimension of word embedding to and the batch size to . In addition, the dropout is for LSTM and Transformer. Adam optimization with the learning rate is used during training. Following the tuning guidelines above, we set the coefficients and of P term and I term to and , respectively. Finally, We adopt the source code on Texar platform to implement experiments Hu et al. .
For dialog-response generation, we follow the model architecture and hyperparameters of the basic conditional VAE in Zhao et al. [2017b]. We use one-layer Bi-directional GRU as the encoder and one-layer GRU as the decoder and two fully-connected layers to learn the latent variable. In the experiment, the size of both latent variable and word embeddings is set to . The maximum length of input/output sequence for GRU is set to with batch size . In addition, Adam with initial learning rate is used. In addition, we set the same and of PI algorithm as text generation above. The model architectures of ControlVAE for these two NLP tasks are illustrated in Table 4, 5.
|Input words||Input ,|
|FC||3-layer Transformer 8 heads|
a.2 Experimental Details for Disentangling
Following the same model architecture of -VAE Higgins et al. , we adopt a convolutional layer and deconvolutional layer for our experiments. We use Adam optimizer with , and a learning rate tuned from . We set and for PI algorithm to and , respectively. For the step function, we set the step, , to per training steps as the information capacity (desired KL- divergence) increases from until for 2D Shape data. ControlVAE uses the same encoder and decoder architecture as -VAE except for plugging in PI control algorithm, illustrated in Table 6.
|Input binary image||Input|
|conv. ReLU. stride 2||FC. 256 ReLU.|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|conv. ReLU. stride 2||upconv. ReLU. stride 2.|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|FC . FC.||upconv. ReLU. stride 2|
a.3 Experimental Details for Image Generation
Similar to the architecture of
-VAE, we use a convolutional layer with batch normalization as the encoder and a deconvolutional layer with batch normalization for our experiments. We use Adam optimizer with, and a learning rate for CelebA data. The size of latent variable is set to , because we find it has a better reconstruction quality than and . In addition, we set the desired value of KL-divergence to (same as the original VAE), , and . For PI control algorithm, we set and to and , respectively. We also use the same encoder and decoder architecture as -VAE above except that we add the batch normalization to improve the stability of model training, as shown in Table 7.
|Input RGB image||Input|
|conv. ReLU. stride 2||FC. 256 ReLU.|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|conv. ReLU. stride 2||upconv. ReLU. stride 2.|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|conv. ReLU. stride 2||upconv. ReLU. stride 2|
|FC . FC.||upconv. ReLU. stride 2|
Appendix B Examples of Generated Dialog by ControlVAE
In this section, we show an example to compare the diversity and relevance of generated dialog by different methods, as illustrated in Table 8. Alice begins with the open-ended conversation on choosing a college. Our model tries to predict the response from Bob. The ground truth response is “um - hum”. We can observe from Table 8 that ControlVAE (KL=25, 35) can generate diverse and relevant response compared with the ground truth. In addition, while cyclical annealing can generate diverse text, some of them are not very relevant to the ground-truth response.
|Context: (Alice) and a lot of the students in that home town sometimes unk the idea of staying and going to school across the street so to speak|
|Topic: Choosing a college Target: (Bob) um - hum|
|ControlVAE-KL-25||ControlVAE-KL-35||Cost annealing (KL=17)||Cyclical anneal (KL=21.5)|
|yeah||uh - huh||oh yeah||yeah that’s true do you do you do it|
|um - hum||yeah||uh - huh||yeah|
|oh that’s right um - hum||oh yeah oh absolutely||right||um - hum|
|yes||right||uh - huh and i think we have to be together||yeah that’s a good idea|
|right||um - hum||oh well that’s neat yeah well||yeah i see it too,it’s a neat place|
Appendix C of ControlVAE for Image Generation on CelebA data
Fig. 7 illustrates the comparison of for different methods during model training. We can observe that finally converges to when the desired value of KL-divergence is set to , same as the original VAE. At this point, ControlVAE becomes the original VAE. Thus, ControlVAE can be customized by users based on different applications.
Appendix D Examples of Generated Images by VAE and ControlVAE
We also show the some generated images by ControlVAE and the original VAE in Fig. 8. It can be observed that images generated by ControlVAE-KL-200 (KL = ) has the best reconstruction quality compared to the original VAE. Take the woman in the first row last column as an example. The woman does not show her teeth in the ground-truth image. However, we can see the woman generated by the original VAE smiles with mouth opening. In contrast, the woman generated by ControlVAE-KL-200 hardly show her teeth when smiling. In addition, we also discover from the other two examples marked with blue rectangles that ControlVAE-KL-200 can better reconstruct the “smile” from the man and the “ear” from the woman compared to the original VAE. Therefore, we can conclude that our ControlVAE can improve the reconstruction quality via slightly increasing (control) KL-divergence compared to the original VAE. It should be pointed out that the comparison results are not very obvious because we use a simple VAE model in the experiments. For future work, we are going to adopt advanced VAE models to improve the performance.