AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling

by   Haoqin Tu, et al.
Tsinghua University

Variational Auto-Encoder (VAE) has become the de-facto learning paradigm in achieving both representation learning and generation for natural language. However, existing VAE-based language models either employ elementary RNNs, which is not powerful to handle complex situations, or fine-tunes two pre-trained language models (PLMs) for any downstream task, which is a huge drain on resources. In this paper, we introduce the first VAE framework empowered with adaptive GPT-2s (AdaVAE). Different from existing systems, we unify both the encoder&decoder of VAE model using GPT-2s with adaptive parameter-efficient components. Experiments from multiple dimensions validate that AdaVAE is competent to better organize language in generation task and representation modeling, even with less than 15% activated parameters in training. Our code is available at <>.


page 1

page 2

page 3

page 4


Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation

Variational auto-encoder (VAE) is a powerful unsupervised learning frame...

Hidden Schema Networks

Most modern language models infer representations that, albeit powerful,...

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions

Recent work on generative modeling of text has found that variational au...

Diagnosing and Enhancing VAE Models

Although variational autoencoders (VAEs) represent a widely influential ...

A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation

This work studies the widely adopted ancestral sampling algorithms for a...

Assistive Recipe Editing through Critiquing

There has recently been growing interest in the automatic generation of ...

Contextual Representation Learning beyond Masked Language Modeling

How do masked language models (MLMs) such as BERT learn contextual repre...

Code Repositories


VAE with adaptive parameter-efficient GPT-2s for language modeling.

view repo

1 Introduction

As a competitive solution to miscellaneous natural language processing (NLP) tasks, variational auto-encoder (VAE)

Bowman et al. (2015) can not only be a powerful generative model, but a feature learning tool when trained properly. As the potential of employing large pre-trained language models (PLMs) has been explored, past works have sought to incoporate large-scale PLMs such as BERT Devlin et al. (2018) and GPT-2 Radford et al. (2019) into VAE model, which strengthens VAEs in various tasks Li et al. (2020); Park and Lee (2021); Fang et al. (2021) and largely avoids KL collapse issue.

However, existing “big VAE” approaches fine-tune all parameters of encoder&decoder, which means at least two separate PLMs are fully activated during training. This leads to prohibitive computational overhead and low training efficiency of models in situations like multi-task learning. As a result, the main obstacle need to overcome when tune a VAE with PLMs becomes: how to schedule the training paradigm of VAE’s encoder&decoder, so that it can be tuned in a dynamically efficient way without sacrificing model performance compared with fine-tuning.

For plain PLMs, taming them with high efficiency in different situations is one of the top trends in NLP. Various lightweight alternatives in PLMs were proposed Houlsby et al. (2019); Pfeiffer et al. (2020); Li and Liang (2021); Hu et al. (2021); Lester et al. (2021); He et al. (2021), they generally share a same approach of introducing additional trainable parameters to PLMs instead of activating the original big transformer-based models. Recently, He et al. (2021) concluded all these methods into a unified paradigm, which explained that these parameter-efficient components can be placed “parallelly” or “sequantially” after the “attention” or “feedforward” layers in transformer blocks. They further verified that these methods generally performs comparably with fine-tuned PLMs in understanding and generation tasks.

Our AdaVAE essentially comprises two adaptive parameter-efficient GPT-2s, which leverages powerful PLMs without excessive resource consumption. We propose Latent Attention operation to better construct the latent space from encoder, and we further investigate multiple latent knowledge fusion methods in the decoder for language generation process. Experiments on three tasks produce promising results regard to model efficiency and automatic metrics.

Dataset. YELP YAHOO #params.
LM Repr. LM Repr.
AdaVAE 15.49 125.56 7.55 32 14.23 121.40 7.49 32
w/ Fine-tuning 18.59 125.02 7.49 32 15.57 121.05 7.52 32
w/ LG 31.01 129.17 3.32 32 19.64 123.16 2.32 32
w/ LG; -PSA; +AtM 27.70 129.53 4.65 32 20.09 121.92 3.61 32
w/ LG; +AtM 30.44 129.39 4.29 32 22.54 122.19 4.36 32
+Prefix 14.96 124.13 6.55 32 15.17 120.89 3.70 32
w/ parallel_attn 16.32 125.91 7.57 32 15.22 122.22 7.40 32
w/ sequential_attn 17.98 127.33 7.55 32 15.05 121.69 7.47 32
Table 1: The proposed AdaVAE with different parameter-efficient/latent generation frameworks on language modeling task. #params. is the percentage of (additional) training parameters compared with the original language model. The in all cases.

Contributions. (1) To our best knowledge, AdaVAE is the first big VAE model with adaptive parameter-efficient PLMs that can be optimized with minimum trainable parameters. (2) We propose Latent Attention in latent space construction method and explore multiple infusion methods as well as adaptive elements. (3) AdaVAE achieves SoTA and comparable performance in language modeling and classification tasks separately with only parameter activated.

2 AdaVAE Methodologies

2.1 Adaptive GPT-2 Encoder and Decoder

The encoder of a VAE should extract features from given contents to produce meaningful latent space, while the decoder ought to generate fluent sentences with given latent representations. In order to obtain a unified word embedding space of the model, we construct both encoder and decoder of AdaVAE using GPT-2, which leaves us no worry of connecting two distinct word embedding spaces as faced in Li et al. (2020). To make GPT-2 a qualified encoder, we take advice from mighty extractors such as BERT, one of its architectural advantages lies in the unmasked/bi-directional transformer layer. Thus we remove the causal mask in GPT-2 transformer layers to make it an encoder of AdaVAE with full vision of input contexts, this mask-free operation is widely used in the encoders of existing PLMs Raffel et al. (2019); Lewis et al. (2019); Fang et al. (2021). As for decoder, we employ GPT-2, which is a powerful generative transformer model in nature.

The paradigm of fine-tuning two separate PLMs in large-scale VAEs requires a lot more computing resources than a single PLM, and the storage requirements will become too heavy to tolerant as the task loads increase. To avoid such dilemma, we propose and explore different parameter-efficient components including different types and different insertion methods into encoder&decoder layers, which means only additional minimum parameters need to be activated for every task in the proposed model (see Section 3.2 for analysis). Overall, these two settings make AdaVAE more elegant to be constructed and more efficient to be trained compared with existing large-scale VAEs Li et al. (2020); Fang et al. (2021); Park and Lee (2021).

LM Repr. LM Repr. LM Prepr.
M. A. 101.40 101.28 0.00 0 40.39 357.76 0.13 1 61.21 328.80 0.00 0
C. A. 108.81 102.81 1.27 5 66.93 332.68 2.77 4
SA-VAE 355.90 1.70 8 60.40 327.20 2.70 10
Aggressive 99.83 101.19 0.83 4 39.84 328.40 2.16 12 59.77 328.40 2.90 19
AE-BP 96.86 102.41 5.31 32 47.97 7.89 32 59.28 329.31 8.08 32
GPT-2 24.23 - - - 23.40 - - - 22.00 - - -
LSTM-LM 100.47 101.04 - - 42.60 358.10 - - 60.75 328.00 - -
T5 VAE 57.69 101.17 11 53.05 166.15 5.55 10 54.40 140.57 5.43 28
Optimus 23.58 91.31 3.78 32 21.99 337.41 2.54 32 22.34 282.70 5.34 32
23.66 91.60 4.29 32 21.99 337.61 2.87 32 22.56 289.88 5.80 32
24.34 93.18 5.98 32 22.20 340.03 5.31 32 22.63 290.69 7.42 32
26.69 96.82 7.64 32 22.79 344.10 7.67 32 23.11 293.34 8.85 32
35.53 77.65 8.18 32 24.59 353.67 9.13 32 24.92 301.21 9.18 32
AdaVAE 23.18 89.27 1.21 32 31.22 115.74 1.07 32 26.53 109.69 1.20 32
18.94 88.50 2.14 32 27.87 116.66 2.21 32 23.69 110.21 2.17 32
11.97 89.52 5.54 32 18.21 116.62 6.02 32 16.04 112.39 5.88 32
12.77 99.46 7.54 32 15.49 125.56 7.55 32 14.23 121.40 7.49 32
27.98 110.35 7.82 32 35.92 139.46 7.62 32 31.01 136.06 7.65 32
Table 2: Language modeling ability of different VAE-based models. represent VAE-based language models with RNNs as both encoder/decoder. Best values of the proposed model and all models are in blue and boldface respectively. is a good choice for AdaVAE that perform better in LM ability and slightly worse in MI measurement compared with Optimus. All baseline results are from Li et al. (2020), except numbers of T5 VAE are from settings with the best trade-off in Park and Lee (2021).

2.2 From Encoder to Latent Space

How to form the from encoder and utilize it in decoder to narrow the gap between discrete input sentences and the continuous latent embedding is a key problem. Li et al. (2020); Park and Lee (2021)

employed the pooled feature of the encoder output and pass it to a simple linear transformation to obtain latent space, which might not be sufficient to leverage the knowledge learnt from transformer layers.

Fang et al. (2021)

used the last state of encoder as the key and value vectors to conduct an attention operation by matrix multiplication. Their model learns both prior as well as posterior of the latent space. We contend that, producing posterior and prior from the same type of source is very likely to cause KL collapse issue. We propose the improved

Latent Attention operation based on previous ideas to produce meaningful latent space in AdaVAE: to get latent vector , we adopt the last hidden state from the encoder and assign:



is an identity matrix with the same size of

, is a linear transformation without bias. Then the latent vector

is taken to reparameterize the mean and variance of

. Note that, we only use it for the posterior of latent space. This setting takes full advantage of summarized information from the encoder and reduces the odds of KL collapse problem that may occur in

Fang et al. (2021).

2.3 From Latent Space to Decoder

Inspired by existing methods, we investigate two different frames to add latent variable into decoder layers. For a latent vector drawn from , 1) Add to Memory (AtM) Li et al. (2020) projects to both attention key and value spaces by a unified linear layer , and concatenate them with key and value vector in each attention layer:


where are the size of key and value spaces, key vector and value vector severally. 2) Pseudo Self-Attention (PSA) Fang et al. (2021) shares a similar idea with AtM, but it uses separate convolutional transformations with as input to make sure that , then PSA concatenates to past states just like AtM to conduct attention operation in decoder layers. In practice, we find PSA is more effective in representation learning (see Sec. 3.2).

System WNLI Yelp SST-2 #params
Dataset Size 634 44k 67k
Feature-based BERT 0.577 0.88 0.731 -
Optimus (VAE) 0.563 0.92 0.789 -
Optimus (AE) 0.620 0.788 -
Fine-tuning BERT 0.544 0.984 0.923
Optimus (VAE) 0.563 0.98 0.924
AdaVAE 0.586 0.968 0.860
Param.-efficient BERT () 0.524 0.965 0.902
BERT () 0.531 0.973 0.911
AdaVAE () 0.563 0.966 0.853
AdaVAE () 0.589 0.961 0.840
Table 3: Latent classification accuracy on datasets with varied sizes of training examples. Fine-tuning baseline results were taken from Li et al. (2020). indicates the hidden sizes of adapters. Results were averaged on 5 runs.

3 Experimental Results and Analysis

3.1 Model and Training Details

The evidence lower bound (ELBO) of a VAE is:


where is the texts to be modeled, and is the latent variable sampled from latent space . We followed previous VAE-related works Li et al. (2019, 2020); Pelsmaeker and Aziz (2019) and applied free bit threshold to the entire KL term:


For model architecture and initialization, the encoder and decoder were separately 8 layers and 12 layers GPT-2 transformers initialized from pre-trained weights in Hugginface111 The parameter efficient components were chosen from Adapter Houlsby et al. (2019) and Prefix Li and Liang (2021), the hidden size of Adapter was varied with tasks, while the hidden size of Prefix was 30 for ablation study in language modeling task. The hidden size of latent space was set to be 32 for language modeling and 768 for classification and generation (exactly the same as Optimus).

We employed cyclic annealing Fu et al. (2019) with 4 cycles for KL weight from 0 to 1. We first activated the parameter-efficient components in encoder and parameters in latent spaces for 2.5k iterations and then added parameter-efficient components in the decoder for the rest of training time, which is helpful for model training Li et al. (2019).

3.2 Language Modeling Ability

For language modeling ability, we took Perplexity (PPL), the negative evidence lower bound (ELBO), mutual information between input sentence and (MI) and activated units in the latent space (AU) as measurements. All metrics were implemented exactly follow the Optimus for fair comparisons. While PPL and ELBO measure the fluency of generated sentences, MI and AU indicate the representation learning capacity of a latent variable model. We first explore the effects of different types of parameter-efficient components as well as latent space generation&infusion methods for AdaVAE.

We explored 8 types of VAE models in Table 1: (1) AdaVAE: Uses parallel adapters for feedforward layer in transformers with hidden size of 128, Latent Attention for latent space construction, PSA for representation infusion. (2) w/ Fine-tuning: Fine-tunes the model. (3) w/ LG: Uses the pooled feature from encoder and a linear layer to form like Optimus. (4) w/ LG; -PSA; +AtM: Uses LG and replace the PSA method with AtM for latent infusion. (5) w/ LG; +AtM: Uses LG and both the PSA&AtM methods together for latent infusion. (6) +Prefix: Adds Prefix Li and Liang (2021) with hidden size 30. (7) w/ parallel_attn: Replaces the original adapters with parallel adapters for attention outputs. (8) w/ sequential_attn: Replaces the original adapters with sequential adapters for attention outputs. We can tell from the statistics in Table 1, AdaVAE achieves the best trade-off between language modeling and representation learning ability. Specially, models with LG for latent space construction get much lower MI value than models with Latent Attention, which demonstrates the unfitness to employ simple linear transformation on transformer feature for space .

In Table 2, (1) AdaVAE holds both the lowest PPL and -ELBO values on three datasets among all baselines. (2) As the free bits threshold increases, MI value generally yields better performance in both Optimus and AdaVAE, but the -ELBO value commonly increases in both models. While the trend of PPL value is monotonous with in Optimus, there is a rebound in AdaVAE with the optimal PPL value when on YELP and YAHOO. Representation learning results demonstrate the appropriateness of employing adapters, Latent Attention and PSA for big VAE models. And the LM results show that AdaVAE with unified GPT-2s fits in producing texts better than VAEs with disparate word embedding spaces for generation.

Source Target
it is a very nice place to call home !
the food is so good and i seriously always feel like family .
Input Output
food was good , served in large portions .
food, and especially the appetizers, always exceeded my
great experience every time at prestige animal clinic !
best experience i have ever had in any restaurant, food wise,
the staff is amazing!
i was very disappointed with the quality and service .
the food was absolutely horrible, i absolutely never tasted the
food and the quality of the service.
Table 4: Sentence transfer via arithmetic . Generated sentences are shown in blue.

3.3 Text Classification with

To validate the proposed model is qualified as a textual feature extractor even with minimum trainable parameters. We conducted full-sized as well as low resource classification task on WNLI, SST-2 datasets and YELP dataset. Feature-based and Param.-efficient represent only activates the classification layer of models during training and parameter-efficient training method in Table 3, and fine-tuning baseline results were all taken from Li et al. (2020). In Figure 1, FT and PE means fine-tuning and parameter-efficient training with adapters of 128 dimensions respectively, all results were averaged on 5 runs with distinct seeds. From these statistics, (1) when the number of labeled training sample is very low (full WNLI / 10 or 100 training samples from YELP), AdaVAE can achieve better classification accuracy than fine-tuned AdaVAE, and is even superior than both fine-tuned and parameter-efficient BERT or Optimus sometimes. (2) For middle sized training data (full / 1,00010,000 training samples from YELP), AdaVAE shows competitive performance compared with BERT and Optimus and generally better than fine-tuned AdaVAE. (3) For large-scale dataset (SST-2), the performance of AdaVAE is inferior than BERT and Optimus by around . These statistics demonstrate that AdaVAE with few activated parameters is competent to extract textual features like specialized PLM such as BERT or state-of-the-art Optimus model. We ascribe it to the structural modification of unmasked GPT-2 transformers as the encoder of AdaVAE. Since we did not change the adapter structure significantly, the training time of our model is to compared with fine-tuning Ding et al. (2022).

Figure 1: Testing accuracy with a varying number of total labeled training samples on YELP dataset.
0.0 the location is clean and the patio is great !
the patio is in the middle of the block and the
open right is better.
0.2 the patio terrace is really nice and on the menu!
the kitchen is perfect, however the menu is small
on the menu option.
after the reservation is open, the place is spacious
and well organized.
the restaurant is spacious with plenty of room to
choose from, even inside a typical yelp.
the menu is a perfect fit for a night stand, and the
waiter’s number is super friendly!
in addition to its extensive menu, the patio is
absolutely a wonderful place!
if you are a fan of the service of a good
restaurant, then definitely take a visit.
0.9 very attentive, especially for a non english tour.
1.0 very special place and highly recommended .
Table 5: Interpolating latent space . Each row shows , and generated sentence conditioned on are shown in blue.

3.4 Sentence Generation by Latent Manipulation

We conducted latent interpolation and analogy tasks for sentence generation. For given sentence triplet , analogy task generates a sentence from source to target and with similar style of as examples are shown in Table 4. For given sentence pair , latent interpolation generates texts with styles transfer from to by latent space traversal as examples are shown in Table 5. We can tell from the table, generated sentences absorb all given sentence features. For example, three output texts in analogy task talk about food, which is relevant to Target . When Input turns to negative, the Output also steers to negative emotion (the last line in Table 4).

4 Conclusion

In this paper, we explored the first large-scale VAE system AdaVAE with unified adaptive GPT-2s. AdaVAE is efficient to be trained, because it freezes both PLM encoder&decoder while adding trainable adapters for tasks. AdaVAE is elegant in construction, because it has unified encoder&decoder of GPT-2 with the same word embedding space. Also AdaVAE is effective for language modeling, because experiments validate AdaVAE with proposed Latent Attention

has competent generative ability and potential feature extraction capacity.


  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C. Chan, W. Chen, et al. (2022) Delta tuning: a comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904. Cited by: §3.3.
  • L. Fang, T. Zeng, C. Liu, L. Bo, W. Dong, and C. Chen (2021)

    Transformer-based conditional variational autoencoder for controllable story generation

    arXiv preprint arXiv:2101.00828. Cited by: §1, §2.1, §2.1, §2.2, §2.3.
  • H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin (2019) Cyclical annealing schedule: a simple approach to mitigating kl vanishing. arXiv preprint arXiv:1903.10145. Cited by: §3.1.
  • J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig (2021)

    Towards a unified view of parameter-efficient transfer learning

    arXiv preprint arXiv:2110.04366. Cited by: §1.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for nlp. In

    International Conference on Machine Learning

    pp. 2790–2799. Cited by: §1, §3.1.
  • E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen (2021) Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. Cited by: §1.
  • B. Lester, R. Al-Rfou, and N. Constant (2021) The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691. Cited by: §1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §2.1.
  • B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang (2019) A surprisingly effective fix for deep latent variable modeling of text. arXiv preprint arXiv:1909.00868. Cited by: §3.1, §3.1.
  • C. Li, X. Gao, Y. Li, B. Peng, X. Li, Y. Zhang, and J. Gao (2020) Optimus: organizing sentences via pre-trained modeling of a latent space. arXiv preprint arXiv:2004.04092. Cited by: §1, §2.1, §2.1, §2.2, §2.3, Table 2, Table 3, §3.1, §3.3.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190. Cited by: §1, §3.1, §3.2.
  • S. Park and J. Lee (2021) Finetuning pretrained transformers into variational autoencoders. arXiv preprint arXiv:2108.02446. Cited by: §1, §2.1, §2.2, Table 2.
  • T. Pelsmaeker and W. Aziz (2019)

    Effective estimation of deep generative language models

    arXiv preprint arXiv:1904.08194. Cited by: §3.1.
  • J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych (2020) AdapterFusion: non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §2.1.